September 20, 2024
“Life is dirty. So is your data. Get used to it.” – Thomas LaRock
Breakdown of Data Science work:
Building models is easy
Formatting your data appropriately is difficult
Interpreting your models takes care
What is tidy data?
Common scenarios prompting data cleaning / data wrangling
MAT300
projectTabular data is tidy if the following three principles are satisfied.
Each row corresponds to a single observational unit (a record).
Each column corresponds to a single measured variable.
Each cell contains the value of the corresponding variable measured on the corresponding record.
Simply put, data is tidy if it is in a format convenient for analysis.
id | date | salary | favorite_color |
---|---|---|---|
31 | 2/25/2018 | 55000 | blue |
79 | 6/26/2021 | $58,500 | purple |
51 | 11/27/2018 | 60,000 | orang |
14 | 5/5/2018 | 22.75 | Blue |
67 | 4/19/2022 | $1400 | BLUE |
42 | 6/27/2012 | 72250 | orange |
Rows: 6
Columns: 4
$ id <int> 31, 79, 51, 14, 67, 42
$ date <chr> "2/25/2018", "6/26/2021", "11/27/2018", "5/5/2018", "4/…
$ salary <chr> "55000", "$58,500", "60,000", "22.75", "$1400", "72250"
$ favorite_color <chr> "blue", "purple", "orang", "Blue", "BLUE", "orange"
id | month | day | year | salary | favorite_color |
---|---|---|---|---|---|
31 | 2 | 25 | 2018 | 55000.00 | blue |
79 | 6 | 26 | 2021 | 58500.00 | purple |
51 | 11 | 27 | 2018 | 60000.00 | orang |
14 | 5 | 5 | 2018 | 22.75 | blue |
67 | 4 | 19 | 2022 | 1400.00 | blue |
42 | 6 | 27 | 2012 | 72250.00 | orange |
Rows: 6
Columns: 6
$ id <int> 31, 79, 51, 14, 67, 42
$ month <dbl> 2, 6, 11, 5, 4, 6
$ day <dbl> 25, 26, 27, 5, 19, 27
$ year <dbl> 2018, 2021, 2018, 2018, 2022, 2012
$ salary <dbl> 55000.00, 58500.00, 60000.00, 22.75, 1400.00, 72250.00
$ favorite_color <chr> "blue", "purple", "orang", "blue", "blue", "orange"
To obtain this tidier version of our data frame, we…
month
, day
, and year
columns with separate()
salary
column with str_replace()
mutate()
and tolower()
month
, day
, year
, and salary
columns to numerics with as.numeric()
Questions: Problems with this small data frame still remain. What are they?
\(\bigstar\) Try solving Challenges 1 - 3 in the Data Wrangling notebook you just downloaded
name | 2019 | 2020 | 2021 | 2022 | 2023 |
---|---|---|---|---|---|
Porter, Andrew | 2.9 | 8.6 | 7.3 | 6.4 | 5.2 |
Galvan, Brandon | 6.0 | 8.3 | 6.0 | 7.3 | 3.8 |
Estrada, Nicole | 4.4 | 10.2 | 9.4 | 7.0 | 5.6 |
Casias, Adelita | 3.0 | 12.1 | 7.1 | 8.9 | 6.6 |
el-Qasim, Mudrika | 7.0 | 11.4 | 6.8 | 5.4 | 5.7 |
al-Ismail, Kaatima | 2.6 | 9.8 | 8.8 | 6.9 | 4.3 |
name | year | value |
---|---|---|
Porter, Andrew | 2019 | 2.9 |
Porter, Andrew | 2020 | 8.6 |
Porter, Andrew | 2021 | 7.3 |
Porter, Andrew | 2022 | 6.4 |
Porter, Andrew | 2023 | 5.2 |
Galvan, Brandon | 2019 | 6.0 |
Galvan, Brandon | 2020 | 8.3 |
Galvan, Brandon | 2021 | 6.0 |
Galvan, Brandon | 2022 | 7.3 |
Galvan, Brandon | 2023 | 3.8 |
Estrada, Nicole | 2019 | 4.4 |
Estrada, Nicole | 2020 | 10.2 |
Estrada, Nicole | 2021 | 9.4 |
Estrada, Nicole | 2022 | 7.0 |
Estrada, Nicole | 2023 | 5.6 |
Casias, Adelita | 2019 | 3.0 |
Casias, Adelita | 2020 | 12.1 |
Casias, Adelita | 2021 | 7.1 |
Casias, Adelita | 2022 | 8.9 |
Casias, Adelita | 2023 | 6.6 |
el-Qasim, Mudrika | 2019 | 7.0 |
el-Qasim, Mudrika | 2020 | 11.4 |
el-Qasim, Mudrika | 2021 | 6.8 |
el-Qasim, Mudrika | 2022 | 5.4 |
el-Qasim, Mudrika | 2023 | 5.7 |
al-Ismail, Kaatima | 2019 | 2.6 |
al-Ismail, Kaatima | 2020 | 9.8 |
al-Ismail, Kaatima | 2021 | 8.8 |
al-Ismail, Kaatima | 2022 | 6.9 |
al-Ismail, Kaatima | 2023 | 4.3 |
name | measurement | value |
---|---|---|
Deleon, Alejandro | height | 182.76206 |
Deleon, Alejandro | weight | 70.26488 |
Padilla, Jakeob | height | 170.56931 |
Padilla, Jakeob | weight | 68.29773 |
al-Baten, Naaila | height | 164.96564 |
al-Baten, Naaila | weight | 75.45603 |
name | height | weight |
---|---|---|
Deleon, Alejandro | 182.7621 | 70.26488 |
Padilla, Jakeob | 170.5693 | 68.29773 |
al-Baten, Naaila | 164.9656 | 75.45603 |
We have two functions that help us make data frames wider or longer
pivot_longer()
converts from wide to long format
id_cols
: a list of columns that should remain fixed (un-pivoted) in the resulting data framenames_to
: the name of a new column in which the original column names should be placedvalues_to
: the name of a new column in which will contain the values underneath the pivoted columns in the original data framepivot_wider()
converts from long to wide format
names_from
: the existing column whose levels will become the column namesvalues_from
: the existing column whose values will be placed under the new column headings
\(\bigstar\) Try using pivot_wider()
and pivot_longer()
to fill in the code chunks in Challenge 4: Wide- and Long-Format Data
Orders Data Frame:
name | ordered | quantity |
---|---|---|
el-Waheed, Husni | peach | 5 |
Caruso, Sarah | mango | 6 |
Wright, John | pear | 5 |
Canchola, Bernardo | mango | 4 |
Garcia, Kai | mango | 1 |
Prices Data Frame:
item | price |
---|---|
apple | 3.02 |
banana | 1.60 |
pear | 4.72 |
mango | 3.83 |
peach | 3.07 |
name | ordered | quantity | price |
---|---|---|---|
el-Waheed, Husni | peach | 5 | 3.07 |
Caruso, Sarah | mango | 6 | 3.83 |
Wright, John | pear | 5 | 4.72 |
Canchola, Bernardo | mango | 4 | 3.83 |
Garcia, Kai | mango | 1 | 3.83 |
How Did We Do It? We joined the data together using a left_join()
The left_join()
function requires:
\(\bigstar\) Try using left_join()
to fill in the code chunk in Challenge 5: Data Across Multiple Tables to join the orders
, clients
, and products
data frames together
As usual, you’ll try this first and then we’ll finish up together
Tools for dealing with inconsistent data entry
mutate()
with ifelse()
, case_when()
, str_replace()
, etc.Tools for changing data types
as.numeric()
, as.integer()
, as.factor()
, etc.Extracting multiple variables from a single column
separate()
Moving between wide and long formats
pivot_longer()
and pivot_wider()
Joining data from multiple tables
left_join()
, full_join()
, inner_join()
, anti_join()
, etc.left_join()
An Introduction to the {tidymodels}
Modeling Framework
A Comforting(?) Note: The coding from here on out gets a bit easier because all of our modeling code will nearly identical from one class meeting to the next. My goal over these last two weeks was to give you some familiarity with R functionality for (i) data transformation, (ii) data visualization, and (iii) data wrangling so that you know what those tools look like and you can “figure out” (with help) how to apply them to data challenges you face.