Data Dirty Work: Wrangling

Dr. Gilbert

September 20, 2024

A Quote and Setting Expectations

“Life is dirty. So is your data. Get used to it.” – Thomas LaRock

Breakdown of Data Science work:

80% data cleaning
20% analysing, modeling, and interpreting

Building models is easy

Formatting your data appropriately is difficult

Interpreting your models takes care

The Highlights

What is tidy data?
Common scenarios prompting data cleaning / data wrangling
- Inconsistent data entry/data collection
- Incorrect/unexpected data types inferred
- A single column encodes multiple variables
- Data currently in wide format when we want long format, and vice-versa
- Features are stored across different tables

Play Along

Navigate to the course webpage
Download the Incomplete Quarto version of the Data Wrangling Workshop notebook
Open RStudio and ensure that you are working in your MAT300 project
Open the file you just downloaded and save it to your project space

Tidy Data

Tabular data is tidy if the following three principles are satisfied.

Each row corresponds to a single observational unit (a record).
Each column corresponds to a single measured variable.
Each cell contains the value of the corresponding variable measured on the corresponding record.
- Note: Missing values are permitted.

Simply put, data is tidy if it is in a format convenient for analysis.

A Problematic Data Set

id	date	salary	favorite_color
31	2/25/2018	55000	blue
79	6/26/2021	$58,500	purple
51	11/27/2018	60,000	orang
14	5/5/2018	22.75	Blue
67	4/19/2022	$1400	BLUE
42	6/27/2012	72250	orange

Rows: 6
Columns: 4
$ id             <int> 31, 79, 51, 14, 67, 42
$ date           <chr> "2/25/2018", "6/26/2021", "11/27/2018", "5/5/2018", "4/…
$ salary         <chr> "55000", "$58,500", "60,000", "22.75", "$1400", "72250"
$ favorite_color <chr> "blue", "purple", "orang", "Blue", "BLUE", "orange"

A Better Version

id	month	day	year	salary	favorite_color
31	2	25	2018	55000.00	blue
79	6	26	2021	58500.00	purple
51	11	27	2018	60000.00	orang
14	5	5	2018	22.75	blue
67	4	19	2022	1400.00	blue
42	6	27	2012	72250.00	orange

Rows: 6
Columns: 6
$ id             <int> 31, 79, 51, 14, 67, 42
$ month          <dbl> 2, 6, 11, 5, 4, 6
$ day            <dbl> 25, 26, 27, 5, 19, 27
$ year           <dbl> 2018, 2021, 2018, 2018, 2022, 2012
$ salary         <dbl> 55000.00, 58500.00, 60000.00, 22.75, 1400.00, 72250.00
$ favorite_color <chr> "blue", "purple", "orang", "blue", "blue", "orange"

A Better Version

To obtain this tidier version of our data frame, we…

split the date column into separate month, day, and year columns with separate()
removed the $ signs and commas from the salary column with str_replace()
fixed “case” discrepancies in the color column with mutate() and tolower()
Converted the month, day, year, and salary columns to numerics with as.numeric()

Questions: Problems with this small data frame still remain. What are they?

Try It!

$\bigstar$ Try solving Challenges 1 - 3 in the Data Wrangling notebook you just downloaded

See if you can figure out how to use the functions from the previous slide on your own
After you’ve tried, we’ll solve one challenge at a time together, with an opportunity for you to try on your own in between

A Dataset That Is Too Wide

name	2019	2020	2021	2022	2023
Porter, Andrew	2.9	8.6	7.3	6.4	5.2
Galvan, Brandon	6.0	8.3	6.0	7.3	3.8
Estrada, Nicole	4.4	10.2	9.4	7.0	5.6
Casias, Adelita	3.0	12.1	7.1	8.9	6.6
el-Qasim, Mudrika	7.0	11.4	6.8	5.4	5.7
al-Ismail, Kaatima	2.6	9.8	8.8	6.9	4.3

A Better Version

name	year	value
Porter, Andrew	2019	2.9
Porter, Andrew	2020	8.6
Porter, Andrew	2021	7.3
Porter, Andrew	2022	6.4
Porter, Andrew	2023	5.2
Galvan, Brandon	2019	6.0
Galvan, Brandon	2020	8.3
Galvan, Brandon	2021	6.0
Galvan, Brandon	2022	7.3
Galvan, Brandon	2023	3.8
Estrada, Nicole	2019	4.4
Estrada, Nicole	2020	10.2
Estrada, Nicole	2021	9.4
Estrada, Nicole	2022	7.0
Estrada, Nicole	2023	5.6
Casias, Adelita	2019	3.0
Casias, Adelita	2020	12.1
Casias, Adelita	2021	7.1
Casias, Adelita	2022	8.9
Casias, Adelita	2023	6.6
el-Qasim, Mudrika	2019	7.0
el-Qasim, Mudrika	2020	11.4
el-Qasim, Mudrika	2021	6.8
el-Qasim, Mudrika	2022	5.4
el-Qasim, Mudrika	2023	5.7
al-Ismail, Kaatima	2019	2.6
al-Ismail, Kaatima	2020	9.8
al-Ismail, Kaatima	2021	8.8
al-Ismail, Kaatima	2022	6.9
al-Ismail, Kaatima	2023	4.3

A Dataset That Is Too Long

name	measurement	value
Deleon, Alejandro	height	182.76206
Deleon, Alejandro	weight	70.26488
Padilla, Jakeob	height	170.56931
Padilla, Jakeob	weight	68.29773
al-Baten, Naaila	height	164.96564
al-Baten, Naaila	weight	75.45603

A Better Version

name	height	weight
Deleon, Alejandro	182.7621	70.26488
Padilla, Jakeob	170.5693	68.29773
al-Baten, Naaila	164.9656	75.45603

Switching Between Wide and Long Formats

We have two functions that help us make data frames wider or longer

pivot_longer() converts from wide to long format
- id_cols: a list of columns that should remain fixed (un-pivoted) in the resulting data frame
- names_to: the name of a new column in which the original column names should be placed
- values_to: the name of a new column in which will contain the values underneath the pivoted columns in the original data frame
pivot_wider() converts from long to wide format
- names_from: the existing column whose levels will become the column names
- values_from: the existing column whose values will be placed under the new column headings

Try It!

$\bigstar$ Try using pivot_wider() and pivot_longer() to fill in the code chunks in Challenge 4: Wide- and Long-Format Data

Like last time, see if you can figure out how to use the functions from the previous slide on your own
After you’ve tried, we’ll complete each task one-by-one (together), with an opportunity for you to try on your own in between

Data Across Different Tables

Orders Data Frame:

name	ordered	quantity
el-Waheed, Husni	peach	5
Caruso, Sarah	mango	6
Wright, John	pear	5
Canchola, Bernardo	mango	4
Garcia, Kai	mango	1

Prices Data Frame:

item	price
apple	3.02
banana	1.60
pear	4.72
mango	3.83
peach	3.07

Joined Data for Analysis

name	ordered	quantity	price
el-Waheed, Husni	peach	5	3.07
Caruso, Sarah	mango	6	3.83
Wright, John	pear	5	4.72
Canchola, Bernardo	mango	4	3.83
Garcia, Kai	mango	1	3.83

How Did We Do It? We joined the data together using a left_join()

There are multiple types of join (left, right, inner, full, anti, etc.)
A left join begins with the data frame on the left (prior to the join function, if piping), and adds on information (columns) from the table on the right
Joins require a key column(s) in both tables to identify when a match exists and information should be appended to the corresponding row

The left_join() function requires:

the two data frames being joined together
the column(s) over which record-matches should be identified

Try It!

$\bigstar$ Try using left_join() to fill in the code chunk in Challenge 5: Data Across Multiple Tables to join the orders, clients, and products data frames together

As usual, you’ll try this first and then we’ll finish up together

Summary: Commonly Utilized Data Wrangling Tools

Tools for dealing with inconsistent data entry
- mutate() with ifelse(), case_when(), str_replace(), etc.
Tools for changing data types
- as.numeric(), as.integer(), as.factor(), etc.
Extracting multiple variables from a single column
- separate()
Moving between wide and long formats
- pivot_longer() and pivot_wider()
Joining data from multiple tables
- left_join(), full_join(), inner_join(), anti_join(), etc.
- Almost always left_join()

Next Time…

An Introduction to the {tidymodels} Modeling Framework

A Comforting(?) Note: The coding from here on out gets a bit easier because all of our modeling code will nearly identical from one class meeting to the next. My goal over these last two weeks was to give you some familiarity with R functionality for (i) data transformation, (ii) data visualization, and (iii) data wrangling so that you know what those tools look like and you can “figure out” (with help) how to apply them to data challenges you face.