Data Dirty Work: Wrangling

Dr. Gilbert

September 20, 2024

A Quote and Setting Expectations

Life is dirty. So is your data. Get used to it.” – Thomas LaRock

Breakdown of Data Science work:

  • 80% data cleaning
  • 20% analysing, modeling, and interpreting

Building models is easy

Formatting your data appropriately is difficult

Interpreting your models takes care

The Highlights

  • What is tidy data?

  • Common scenarios prompting data cleaning / data wrangling

    • Inconsistent data entry/data collection
    • Incorrect/unexpected data types inferred
    • A single column encodes multiple variables
    • Data currently in wide format when we want long format, and vice-versa
    • Features are stored across different tables

Play Along

  1. Navigate to the course webpage
  2. Download the Incomplete Quarto version of the Data Wrangling Workshop notebook
  3. Open RStudio and ensure that you are working in your MAT300 project
  4. Open the file you just downloaded and save it to your project space

Tidy Data

Tabular data is tidy if the following three principles are satisfied.

  • Each row corresponds to a single observational unit (a record).

  • Each column corresponds to a single measured variable.

  • Each cell contains the value of the corresponding variable measured on the corresponding record.

    • Note: Missing values are permitted.

Simply put, data is tidy if it is in a format convenient for analysis.

A Problematic Data Set

id date salary favorite_color
31 2/25/2018 55000 blue
79 6/26/2021 $58,500 purple
51 11/27/2018 60,000 orang
14 5/5/2018 22.75 Blue
67 4/19/2022 $1400 BLUE
42 6/27/2012 72250 orange
Rows: 6
Columns: 4
$ id             <int> 31, 79, 51, 14, 67, 42
$ date           <chr> "2/25/2018", "6/26/2021", "11/27/2018", "5/5/2018", "4/…
$ salary         <chr> "55000", "$58,500", "60,000", "22.75", "$1400", "72250"
$ favorite_color <chr> "blue", "purple", "orang", "Blue", "BLUE", "orange"

A Better Version

id month day year salary favorite_color
31 2 25 2018 55000.00 blue
79 6 26 2021 58500.00 purple
51 11 27 2018 60000.00 orang
14 5 5 2018 22.75 blue
67 4 19 2022 1400.00 blue
42 6 27 2012 72250.00 orange
Rows: 6
Columns: 6
$ id             <int> 31, 79, 51, 14, 67, 42
$ month          <dbl> 2, 6, 11, 5, 4, 6
$ day            <dbl> 25, 26, 27, 5, 19, 27
$ year           <dbl> 2018, 2021, 2018, 2018, 2022, 2012
$ salary         <dbl> 55000.00, 58500.00, 60000.00, 22.75, 1400.00, 72250.00
$ favorite_color <chr> "blue", "purple", "orang", "blue", "blue", "orange"

A Better Version

To obtain this tidier version of our data frame, we…

  1. split the date column into separate month, day, and year columns with separate()
  2. removed the $ signs and commas from the salary column with str_replace()
  3. fixed “case” discrepancies in the color column with mutate() and tolower()
  4. Converted the month, day, year, and salary columns to numerics with as.numeric()

Questions: Problems with this small data frame still remain. What are they?

Try It!


\(\bigstar\) Try solving Challenges 1 - 3 in the Data Wrangling notebook you just downloaded

  • See if you can figure out how to use the functions from the previous slide on your own
  • After you’ve tried, we’ll solve one challenge at a time together, with an opportunity for you to try on your own in between

A Dataset That Is Too Wide

name 2019 2020 2021 2022 2023
Porter, Andrew 2.9 8.6 7.3 6.4 5.2
Galvan, Brandon 6.0 8.3 6.0 7.3 3.8
Estrada, Nicole 4.4 10.2 9.4 7.0 5.6
Casias, Adelita 3.0 12.1 7.1 8.9 6.6
el-Qasim, Mudrika 7.0 11.4 6.8 5.4 5.7
al-Ismail, Kaatima 2.6 9.8 8.8 6.9 4.3

A Better Version

name year value
Porter, Andrew 2019 2.9
Porter, Andrew 2020 8.6
Porter, Andrew 2021 7.3
Porter, Andrew 2022 6.4
Porter, Andrew 2023 5.2
Galvan, Brandon 2019 6.0
Galvan, Brandon 2020 8.3
Galvan, Brandon 2021 6.0
Galvan, Brandon 2022 7.3
Galvan, Brandon 2023 3.8
Estrada, Nicole 2019 4.4
Estrada, Nicole 2020 10.2
Estrada, Nicole 2021 9.4
Estrada, Nicole 2022 7.0
Estrada, Nicole 2023 5.6
Casias, Adelita 2019 3.0
Casias, Adelita 2020 12.1
Casias, Adelita 2021 7.1
Casias, Adelita 2022 8.9
Casias, Adelita 2023 6.6
el-Qasim, Mudrika 2019 7.0
el-Qasim, Mudrika 2020 11.4
el-Qasim, Mudrika 2021 6.8
el-Qasim, Mudrika 2022 5.4
el-Qasim, Mudrika 2023 5.7
al-Ismail, Kaatima 2019 2.6
al-Ismail, Kaatima 2020 9.8
al-Ismail, Kaatima 2021 8.8
al-Ismail, Kaatima 2022 6.9
al-Ismail, Kaatima 2023 4.3

A Dataset That Is Too Long

name measurement value
Deleon, Alejandro height 182.76206
Deleon, Alejandro weight 70.26488
Padilla, Jakeob height 170.56931
Padilla, Jakeob weight 68.29773
al-Baten, Naaila height 164.96564
al-Baten, Naaila weight 75.45603

A Better Version

name height weight
Deleon, Alejandro 182.7621 70.26488
Padilla, Jakeob 170.5693 68.29773
al-Baten, Naaila 164.9656 75.45603

Switching Between Wide and Long Formats

We have two functions that help us make data frames wider or longer

  1. pivot_longer() converts from wide to long format

    • id_cols: a list of columns that should remain fixed (un-pivoted) in the resulting data frame
    • names_to: the name of a new column in which the original column names should be placed
    • values_to: the name of a new column in which will contain the values underneath the pivoted columns in the original data frame
  2. pivot_wider() converts from long to wide format

    • names_from: the existing column whose levels will become the column names
    • values_from: the existing column whose values will be placed under the new column headings

Try It!


\(\bigstar\) Try using pivot_wider() and pivot_longer() to fill in the code chunks in Challenge 4: Wide- and Long-Format Data

  • Like last time, see if you can figure out how to use the functions from the previous slide on your own
  • After you’ve tried, we’ll complete each task one-by-one (together), with an opportunity for you to try on your own in between

Data Across Different Tables

Orders Data Frame:

name ordered quantity
el-Waheed, Husni peach 5
Caruso, Sarah mango 6
Wright, John pear 5
Canchola, Bernardo mango 4
Garcia, Kai mango 1

Prices Data Frame:

item price
apple 3.02
banana 1.60
pear 4.72
mango 3.83
peach 3.07

Joined Data for Analysis

name ordered quantity price
el-Waheed, Husni peach 5 3.07
Caruso, Sarah mango 6 3.83
Wright, John pear 5 4.72
Canchola, Bernardo mango 4 3.83
Garcia, Kai mango 1 3.83

How Did We Do It? We joined the data together using a left_join()

  • There are multiple types of join (left, right, inner, full, anti, etc.)
  • A left join begins with the data frame on the left (prior to the join function, if piping), and adds on information (columns) from the table on the right
  • Joins require a key column(s) in both tables to identify when a match exists and information should be appended to the corresponding row

The left_join() function requires:

  • the two data frames being joined together
  • the column(s) over which record-matches should be identified

Try It!


\(\bigstar\) Try using left_join() to fill in the code chunk in Challenge 5: Data Across Multiple Tables to join the orders, clients, and products data frames together


As usual, you’ll try this first and then we’ll finish up together

Summary: Commonly Utilized Data Wrangling Tools

  • Tools for dealing with inconsistent data entry

    • mutate() with ifelse(), case_when(), str_replace(), etc.
  • Tools for changing data types

    • as.numeric(), as.integer(), as.factor(), etc.
  • Extracting multiple variables from a single column

    • separate()
  • Moving between wide and long formats

    • pivot_longer() and pivot_wider()
  • Joining data from multiple tables

    • left_join(), full_join(), inner_join(), anti_join(), etc.
    • Almost always left_join()

Next Time…


An Introduction to the {tidymodels} Modeling Framework

A Comforting(?) Note: The coding from here on out gets a bit easier because all of our modeling code will nearly identical from one class meeting to the next. My goal over these last two weeks was to give you some familiarity with R functionality for (i) data transformation, (ii) data visualization, and (iii) data wrangling so that you know what those tools look like and you can “figure out” (with help) how to apply them to data challenges you face.