Data Wrangling and Tidy Data

Published

August 22, 2024

Code

rm(list = ls())
library(tidyverse)
library(tidymodels)
#install.packages("randomNames") #cannot have install.packages() in a notebook
library(randomNames)
library(kableExtra)

options(kable_styling_bootstrap_options = c("hover", "striped"))

theme_set(theme_bw())

ikea <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-11-03/ikea.csv')

###############################
#Generate Data
###############################
set.seed(300)
items <- ikea %>%
  count(category) %>%
  sample_n(5) %>%
  pull(category)

items_dup <- items %>%
  append(c("Outdoor Furniture", "Nursary Furnature"))

products <- tibble(items = items_dup, prices = c("$125", "$675", "$1,350", "$475", "$725", "$675", "$475"))

clients <- tibble(client_id = seq(1, 10, 1), 
                  name = randomNames(10),
                  phone_number = paste0("(",
                                        sample(111:777, 10),
                                        ") ",
                                        sample(111:777, 10),
                                        "-",
                                        sample(1111:7777, 10)))

numRows <- 126735

orders <- tibble(client_id = sample(1:10, size = numRows, replace = TRUE), product = sample(items, size = numRows, replace = TRUE), quantity = 1 + rpois(numRows, lambda = 2))

purchase_history <- crossing(client_id = clients$client_id, items = products$items, year = c("2018", "2019", "2020", "2021")) %>%
  mutate(quantity = 150 + (-1)^(sample(0:1, 280, replace = TRUE))*rpois(280, 50)) %>%
  left_join(products) %>%
  mutate(prices = str_replace(prices, "\\$", ""),
         prices = str_replace(prices, ",", ""),
         prices = as.numeric(prices),
         total_purchased = quantity*prices) %>%
  select(-quantity, - prices)

#purchase_history

historical_purchases <- purchase_history %>%
  group_by(client_id, year) %>%
  summarize(total_value = sum(total_purchased)) %>%
  ungroup() %>%
  pivot_wider(names_from = year, values_from = total_value)

##################################
#End Data Generation
##################################

#Remove all intermediate variables
rm(list = setdiff(ls(), c("clients", "orders", "products", "purchase_history", "historical_purchases")))

Objectives

This notebook gives an overview of some of the most common problems that data come with, and some methods for dealing with those issues.

Inconsistent data recording issues
Unexpected data types
Multi-variable columns
Wide- and Long-format data
Data across multiple tables

Set-Up

Run all of the code in the initial code chunk, line by line. You will likely encounter an error on line 20 because you don’t have the randomNames package installed on your machine. In line 19, un-comment install.packages("randomNames") and run it. Now re-run line 20 and continue running each line in the code chunk. You can examine the code if you like, but it is not necessary.

The code generates several tables of data, exhibiting some common challenges when dealing with data. The result of running the code in that chunk will be five tables in your Global Environment: clients, orders, products, purchase_history, and historical_purchases. Use your basic data exploration functions (glimpse(), head(), tail(), summary(), etc.) to understand these data frames.

Challenge 1: Inconsistent Data Recording

Take a look at the products data frame. The data frame only has seven rows, but among those rows are seemingly duplicated records. Luckily, while the item categories are duplicated (with different spelling and capitalization), the prices are consistent. We have several options available:

products %>%
  kable() %>%
  kable_styling()

items	prices
TV & media furniture	$125
Outdoor furniture	$675
Beds	$1,350
Nursery furniture	$475
Sideboards, buffets & console tables	$725
Outdoor Furniture	$675
Nursary Furnature	$475

Manually drop the duplicated rows
- Our data frame is small enough for this, but with a larger data frame, this might be tedious.
- If this course of action is to be taken, we must make sure that we are not accidentally deleting data.
We can use case_when() to manually override the values in some cells.

There are certainly other things we could do as well. These two approaches are shown below.

#Filter out the duplicated rows
products %>%
  filter(!(items %in% c("Outdoor Furniture", "Nursary Furnature"))) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))

items	prices
TV & media furniture	$125
Outdoor furniture	$675
Beds	$1,350
Nursery furniture	$475
Sideboards, buffets & console tables	$725

#Use Case/When to overwrite values and then filter to distinct rows
products %>%
  mutate(items = case_when(
    items == "Outdoor Furniture" ~ "Outdoor furniture",
    items == "Nursary Furnature" ~ "Nursery furniture",
    TRUE ~ items
  )) %>%
  distinct() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))

items	prices
TV & media furniture	$125
Outdoor furniture	$675
Beds	$1,350
Nursery furniture	$475
Sideboards, buffets & console tables	$725

The case_when() approach is more applicable than simply removing the rows, because it isn’t often that we would actually like to remove rows. We actually do mean to remove rows here, though. Choose your favorite method and overwrite the products data frame with a version of itself that does not have the duplicated rows.

#update the products data frame here

Challenge 2: Unexpected Data Types

Perhaps you’ve noticed that the prices column in the products data frame, is being interpreted as a character data type. We would prefer for that column to be a numeric variable so that we can run computations with it. There are two issues causing the prices to be characters – the dollar sign ($) and the comma separator. We’ll need to remove those and convert the column to a numeric value.

products %>%
  kable() %>%
  kable_styling()

items	prices
TV & media furniture	$125
Outdoor furniture	$675
Beds	$1,350
Nursery furniture	$475
Sideboards, buffets & console tables	$725
Outdoor Furniture	$675
Nursary Furnature	$475

We’ll make use of the str_replace() function for this. This function allows us to search for a sub-string and, if it is found, replace it with a new substring. We’ll replace any found dollar signs and commas with nothing…effectively erasing them. Once this is done, we’ll use as.numeric() to change the data type. As a reminder, we are changing an existing column in our data frame here, so we’ll make use of mutate().

products %>%
  mutate(
    prices = str_replace(prices, "\\$", ""),
    prices = str_replace(prices, ",", ""),
    prices = as.numeric(prices)
  )  %>%
  kable() %>%
  kable_styling()

items	prices
TV & media furniture	125
Outdoor furniture	675
Beds	1350
Nursery furniture	475
Sideboards, buffets & console tables	725
Outdoor Furniture	675
Nursary Furnature	475

The dollar sign is a special character in R, so we needed to escape it with \\$ to convey that we want to match a literal dollar sign. Edit the code cell above so that the changes you made to the prices column are saved and the products data frame is updated.

Challenge 3: Multivariable Columns

We’ll take a look at the clients data frame. The name column contains lastName, firstName for each client.

clients %>%
  kable() %>%
  kable_styling()

client_id	name	phone_number
1	Taylor, Trevor	(284) 350-2197
2	Silva, Jonathan	(768) 647-3338
3	De Lara, Maria	(249) 478-2990
4	Knobel, Justin	(699) 237-5637
5	Weber, Kevin	(243) 263-2235
6	el-Mattar, Najma	(738) 441-7703
7	Campbell, Justin	(750) 177-2744
8	al-Barakat, Shamaail	(584) 518-7284
9	Schell, Jakob	(332) 722-4831
10	Fields, Briana	(417) 162-2994

Perhaps we’d like to have separate columns for first_name and last_name. We can use the separate() function for this.

clients %>%
  separate(name, into = c("last_name", "first_name"), sep = ", ") %>%
  kable() %>%
  kable_styling()

client_id	last_name	first_name	phone_number
1	Taylor	Trevor	(284) 350-2197
2	Silva	Jonathan	(768) 647-3338
3	De Lara	Maria	(249) 478-2990
4	Knobel	Justin	(699) 237-5637
5	Weber	Kevin	(243) 263-2235
6	el-Mattar	Najma	(738) 441-7703
7	Campbell	Justin	(750) 177-2744
8	al-Barakat	Shamaail	(584) 518-7284
9	Schell	Jakob	(332) 722-4831
10	Fields	Briana	(417) 162-2994

This function will separate over all instances of the separator, so we’ll need a new column name for each of the resulting parts. As a reminder, we use c() in R when providing a list of values.

Challenge 4: Wide- and Long-Format Data

Tidy data is data that satisfies the following properties:

Every column is a measured variable.
Each row corresponds to a single record or observational unit.

Data is said to be in a wide format if columns aren’t necessarily measured variables. Instead, the columns themselves correspond to a variable. While we typically prefer long-format data for analysis, there are certainly uses for wide-formats. In particular, financial data is usually presented or stored in wide format. Additionally, it is possible for data to be too long. We can use pivot_longer() to move from wide-format to long format and we can use pivot_wider() to move from long format to wide.

Let’s take a look at the historical_purchases data frame.

historical_purchases %>%
  head()  %>%
  kable() %>%
  kable_styling()

client_id	2018	2019	2020	2021
1	833875	791600	673325	640800
2	713725	800850	566900	662300
3	662375	575775	514150	597475
4	563050	814400	615950	645775
5	684500	708400	604825	835100
6	644025	734475	809775	759575

That data frame contains purchase totals for each client across multiple years. The data is in wide format because, for example "2018" is not a variable we can measure for each client_id. That column name, "2018" is the value of a hidden year variable. We’ll use pivot_longer() to transform our data, creating a year column and total_purchases column.

historical_purchases %>%
  pivot_longer(-client_id, names_to = "year", values_to = "total_purchases") %>%
  head()  %>%
  kable() %>%
  kable_styling()

client_id	year	total_purchases
1	2018	833875
1	2019	791600
1	2020	673325
1	2021	640800
2	2018	713725
2	2019	800850

Now our data frame has become longer! The year variable is currently stored as a character, we can convert it to a number using mutate() and as.numeric().

historical_purchases %>%
  pivot_longer(-client_id, names_to = "year", values_to = "total_purchases") %>%
  mutate(year = as.numeric(year)) %>%
  head()  %>%
  kable() %>%
  kable_styling()

client_id	year	total_purchases
1	2018	833875
1	2019	791600
1	2020	673325
1	2021	640800
2	2018	713725
2	2019	800850

Just so you can see how pivot_wider() works, let’s pivot this tidy data frame back to a wide format, but include a column for each client_id rather than year.

historical_purchases %>%
  pivot_longer(-client_id, names_to = "year", values_to = "total_purchases") %>%
  mutate(year = as.numeric(year)) %>%
  pivot_wider(id_cols = year, names_from = client_id, values_from = total_purchases) %>%
  kable() %>%
  kable_styling()

year	1	2	3	4	5	6	7	8	9	10
2018	833875	713725	662375	563050	684500	644025	759025	626100	716350	537425
2019	791600	800850	575775	814400	708400	734475	556975	594625	630225	433850
2020	673325	566900	514150	615950	604825	809775	734250	662200	761675	575950
2021	640800	662300	597475	645775	835100	759575	773275	715150	634100	735050

Moving between wide and long formats is a pretty useful skill. I use it often when I am summarizing data and would like to display a summary table in a report. These summary tables sometimes look more pleasing and are easier to digest in a wide format.

Challenge 5: Data Across Multiple Tables

It isn’t often the case that all of your data is neatly confined to a single table. In fact, there are very good database-design and infrastructure reasons for this not to be the case. Database Engineers and Administrators work very hard to normalize their databases. This normalization doesn’t have anything to do with finding means or scaling – it is a design strategy that removes redundancy and reduces storage size requirements.

The advantage to space-saving is fairly obvious, but the redundancy-reduction may not be. Consider, for example that one of our clients updated their phone number. If we had all of our data stored in a single table, then we would need to update their phone number in every single record corresponding to that client. Because our data has been normalized, however, the only place phone numbers are stored is in the clients data frame. We only need to update that client’s phone number in that one small table, because that is the only place the phone number has been stored.

The drawback to working with data stored this way is that we’ll need to pull information from across multiple tables in order to obtain the data we’d like to work with. For example, perhaps we’d like a version of the orders data frame which includes the ordering client’s name and phone number, along with the total dollar value of the order they made. The data we need is spread across the orders, clients, and products data frames.

orders %>%
  head() %>%
  kable() %>%
  kable_styling()

client_id	product	quantity
1	Outdoor furniture	2
2	Outdoor furniture	4
9	TV & media furniture	4
4	Beds	2
3	TV & media furniture	2
5	Outdoor furniture	2

clients  %>%
  kable() %>%
  kable_styling()

client_id	name	phone_number
1	Taylor, Trevor	(284) 350-2197
2	Silva, Jonathan	(768) 647-3338
3	De Lara, Maria	(249) 478-2990
4	Knobel, Justin	(699) 237-5637
5	Weber, Kevin	(243) 263-2235
6	el-Mattar, Najma	(738) 441-7703
7	Campbell, Justin	(750) 177-2744
8	al-Barakat, Shamaail	(584) 518-7284
9	Schell, Jakob	(332) 722-4831
10	Fields, Briana	(417) 162-2994

products %>%
  kable() %>%
  kable_styling()

items	prices
TV & media furniture	$125
Outdoor furniture	$675
Beds	$1,350
Nursery furniture	$475
Sideboards, buffets & console tables	$725
Outdoor Furniture	$675
Nursary Furnature	$475

We can use “joins” to bring all of this information into a single table using the following steps.

Begin with the orders table.
Add columns for the client’s name and phone number by joining the clients data frame onto the orders table.
Add a column for product prices by joining the products data frame onto the resulting table.
Use mutate() to compute the total dollar value for each order.

There are lots of different types of joins that can be made between data frames. I find that what I usually want is a left_join() – that is, start with the table on the left and lets join new variables on to it. We’ll be using that here. The left_join() will add new variables on to our data frame by matching values on shared columns (or columns that we explicitly define as shared via the by argument).

orders %>%
  left_join(clients) %>%
  left_join(products, by = c("product" = "items")) %>%
  head()  %>%
  kable() %>%
  kable_styling()

Joining with `by = join_by(client_id)`

client_id	product	quantity	name	phone_number	prices
1	Outdoor furniture	2	Taylor, Trevor	(284) 350-2197	$675
2	Outdoor furniture	4	Silva, Jonathan	(768) 647-3338	$675
9	TV & media furniture	4	Schell, Jakob	(332) 722-4831	$125
4	Beds	2	Knobel, Justin	(699) 237-5637	$1,350
3	TV & media furniture	2	De Lara, Maria	(249) 478-2990	$125
5	Outdoor furniture	2	Weber, Kevin	(243) 263-2235	$675

When using joins, it is important to verify that everything has worked as you expected. While not fool-proof, I usually check to ensure that the number of columns and number of rows in the resulting data frame are what I expected.

Summary

There’s lots of ways our data can be inconvenient. Wrangling data is a really important skill to gain. This notebook gave you exposure to some of the most common data wrangling/cleaning tasks.

Using case_when() to update individual values within a column, using some criteria.
Using distinct() to remove duplicate rows in a data frame.
- Be careful when removing rows of data. Make sure that you are truly removing duplicate records rather than distinct records which are identical by chance.
Using str_replace() to remove or replace characters in a column.
Using as.numeric() to convert a column to a numeric data type.
- We can also use as.character() to convert a column to a string data type or as.factor() to convert explicitly to a categorical variable.
Using separate() in the case that a column contains more than one variable.
- We can use unite() to combine multiple columns into a single column.
Use pivot_longer() to convert from wide-format to long-format.
- We can use pivot_wider() to convert from long to wide.
If our data is split across multiple tables, we can use *_join() to join variables from multiple tables together into a single table.
- There are lots of varieties of join – the left_join() begins with the table on the left and adds variables on to it. This is by far the most frequent join that I’ve used.

This notebook hasn’t covered every wrangling tool you’ll need, but it has given you a toolbox that will help in a variety of common scenarios.