Enter the Tidyverse: An Introduction to Tidy Data Analysis in R

Objectives

This notebook addresses the following items.

How do I install and load packages in R? In particular we’ll work with the tidyverse.
How do I read data into R from both local and remote sources?
How do I interact with, and manipulate, data using the tools and principles of the tidyverse?

Installing and Loading Packages

We can install R packages using the command install.packages("PACKAGE_NAME"). Once packages are installed, we can load them into an R Session by running library(PACKAGE_NAME). While packages only need to be installed once, they must be loaded in each R Session you intend to use them in (note: an R Session begins when R/RStudio are opened and ends when they are closed or terminated). We can install and load the tidyverse by running the code below:

install.packages("tidyverse")
library(tidyverse)

Open RStudio and run these commands in the Console pane (left/lower-left). We’ll be using the tidytext package in the summer 2021 workshop – install this package as well. You can also load it if you want, but we won’t use it here.

Loading Data

Now that you have the tidyverse loaded, the next thing we’ll need is actual data to manipulate. The tidyverse comes with a few standard data sets for practicing with, but we’ll be much more interested in working with our own data which we’ll either find locally (stored on your own computer) or remotely (accessed via a web URL). The tidyverse includes several functions for reading data in a variety of formats:

read_csv("PATH_TO_FILE") can be used to read data from a comma separated values (csv) file.
read_delim("PATH_TO_FILE", delim = "DELIMITER") is a more general version of the read_csv() function – we can use this to read text files whose delimiter is something other than a comma. Common delimiters are the tab (\t) or space (\s).
read_excel("PATH_TO_FILE", sheet = "SHEET_NAME") can be used to read data from a particular sheet within an xls or xlsx file.

The following examples show how we can read a variety of files into an R Session.

#Read the MAT241 sheet from the grades.xls file in 
#the Spring 2021 folder on my computer's desktop
grades <- read_excel("C:/Users/agilb/Desktop/Spring 2021/grades.xls", sheet = "MAT241")

#Read in data from a csv file of Tate Gallery Artists housed 
#in a public github repository on the web
tate_artists <- read_csv("https://github.com/tategallery/collection/raw/master/artist_data.csv")

#Read in data from a csv file of Tate Gallery Artworks housed
#in a public github repository on the web
#*Note* that read_csv() would have worked just fine here too
tate_works <- read_delim("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-01-12/artwork.csv", delim = ",")

Viewing Data

Now that we’ve got data, the first thing we should do is look at it. There are a few really handy R functions for getting a feel for the data you have access to. The View(), head(), tail(), and glimpse() functions are four that are really commonly used. For the remainder of this notebook we’ll use a data frame called mpg which is built into the tidyverse.

Running View(mpg) will open a file viewer which allows you to navigate the data frame in a familiar spreadsheet format.

Using head(mpg) and tail(mpg) give us a convenient method for looking at the first six and last six rows of a data frame, respectively. This is typically enough to give us an idea of the type of data we are working with. Running both of these functions can also make us aware of potential inconsistencies in data collection.

head(mpg)

## # A tibble: 6 x 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa~
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa~
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa~
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa~
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa~
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa~

tail(mpg)

## # A tibble: 6 x 11
##   manufacturer model  displ  year   cyl trans     drv     cty   hwy fl    class 
##   <chr>        <chr>  <dbl> <int> <int> <chr>     <chr> <int> <int> <chr> <chr> 
## 1 volkswagen   passat   1.8  1999     4 auto(l5)  f        18    29 p     midsi~
## 2 volkswagen   passat   2    2008     4 auto(s6)  f        19    28 p     midsi~
## 3 volkswagen   passat   2    2008     4 manual(m~ f        21    29 p     midsi~
## 4 volkswagen   passat   2.8  1999     6 auto(l5)  f        16    26 p     midsi~
## 5 volkswagen   passat   2.8  1999     6 manual(m~ f        18    26 p     midsi~
## 6 volkswagen   passat   3.6  2008     6 auto(s6)  f        17    26 p     midsi~

Running glimpse(mpg) provides us with a bit more technical information about how R is interpreting the columns of the mpg data frame. Knowing how R is interpreting our variables (columns) is important because certain operations are possible with numerical data but are not possible with categorical data, and vice-versa. Common data types in R are chr/fct (categorical data) and num/dbl/int (numerical data).

glimpse(mpg)

## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "~
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "~
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.~
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200~
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, ~
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto~
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4~
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1~
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2~
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p~
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c~

Manipulating Data

Now that we know how to load and view our data, let’s talk about manipulating it. We can restrict the data we are working with, produce summaries of the data, transform the data, and more.

Pipes `%>%`

Pipes are a functionality that is included in a package that is part of tidyverse library. At first, the syntax may seem a bit strange, but pipes allow you to easily manipulate data without having to rename and save the dataset along the way. I strongly encourage you get used to working with pipes! In the previous section we saw how to use R’s head() function to look at the first six rows of the dataset. Here’s how to achieve the same outcome with the use of the pipe (%>%) operator.

mpg %>%
  head()

## # A tibble: 6 x 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa~
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa~
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa~
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa~
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa~
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa~

You can read the code above as saying “take the mpg dataset, and plug it into the head() function”. Putting head() indented on a new line is not necessary for the code to work, but it does make the code easier to read. This new method of asking for the head() of the dataset may seem silly and inefficient, but the real magic of the pipe is that it allows us to chain operations together in a way that mimics the way humans think about instructions. We’ll see this in action as we get exposure to more data manipulation tools below.

Restricting Data

The most common methods for restricting data deal with filtering out rows or columns so that we are only working with a subset of our original data set.

Filtering Rows (`filter()`)

Sometimes we are not interested in all of the observations in a particular dataset, but only those satisfying certain criteria. For example, maybe we only want to see vehicles falling into the class of subcompact cars. The filter() function will allow us to get rid of all other classes of vehicle.

mpg %>% 
  filter(class == "subcompact") %>%
  head()

## # A tibble: 6 x 11
##   manufacturer model  displ  year   cyl trans    drv     cty   hwy fl    class  
##   <chr>        <chr>  <dbl> <int> <int> <chr>    <chr> <int> <int> <chr> <chr>  
## 1 ford         musta~   3.8  1999     6 manual(~ r        18    26 r     subcom~
## 2 ford         musta~   3.8  1999     6 auto(l4) r        18    25 r     subcom~
## 3 ford         musta~   4    2008     6 manual(~ r        17    26 r     subcom~
## 4 ford         musta~   4    2008     6 auto(l5) r        16    24 r     subcom~
## 5 ford         musta~   4.6  1999     8 auto(l4) r        15    21 r     subcom~
## 6 ford         musta~   4.6  1999     8 manual(~ r        15    22 r     subcom~

We can also use more complex conditions on which rows to see using and (&) and or (|) statements. Maybe we want to see only those vehicles in the made by subaru or getting at least a 35 highway mile per gallon rating (hwy).

mpg %>% 
  filter(manufacturer == "subaru" | hwy >= 35) %>%
  head()

## # A tibble: 6 x 11
##   manufacturer model    displ  year   cyl trans   drv     cty   hwy fl    class 
##   <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr> <chr> 
## 1 honda        civic      1.8  2008     4 auto(l~ f        25    36 r     subco~
## 2 honda        civic      1.8  2008     4 auto(l~ f        24    36 c     subco~
## 3 subaru       foreste~   2.5  1999     4 manual~ 4        18    25 r     suv   
## 4 subaru       foreste~   2.5  1999     4 auto(l~ 4        18    24 r     suv   
## 5 subaru       foreste~   2.5  2008     4 manual~ 4        20    27 r     suv   
## 6 subaru       foreste~   2.5  2008     4 manual~ 4        19    25 p     suv

Selecting Columns (`select()`)

Similarly to the way we can filter rows, we can select only those columns we are interested in. We can pass the names of the columns we are interested in to R’s select() function so that we only see those selected columns returned.

mpg %>%
  select(manufacturer, model, year, cty, hwy, class) %>%
  head()

## # A tibble: 6 x 6
##   manufacturer model  year   cty   hwy class  
##   <chr>        <chr> <int> <int> <int> <chr>  
## 1 audi         a4     1999    18    29 compact
## 2 audi         a4     1999    21    29 compact
## 3 audi         a4     2008    20    31 compact
## 4 audi         a4     2008    21    30 compact
## 5 audi         a4     1999    16    26 compact
## 6 audi         a4     1999    18    26 compact

We can also select all columns except certain ones by preceding the column name with a -.

mpg %>%
  select(-displ,-cyl) %>%
  head()

## # A tibble: 6 x 9
##   manufacturer model  year trans      drv     cty   hwy fl    class  
##   <chr>        <chr> <int> <chr>      <chr> <int> <int> <chr> <chr>  
## 1 audi         a4     1999 auto(l5)   f        18    29 p     compact
## 2 audi         a4     1999 manual(m5) f        21    29 p     compact
## 3 audi         a4     2008 manual(m6) f        20    31 p     compact
## 4 audi         a4     2008 auto(av)   f        21    30 p     compact
## 5 audi         a4     1999 auto(l5)   f        16    26 p     compact
## 6 audi         a4     1999 manual(m5) f        18    26 p     compact

The select() function is also useful for changing the order of the columns.

mpg %>%
  select(cty, hwy, manufacturer) %>%
  head()

## # A tibble: 6 x 3
##     cty   hwy manufacturer
##   <int> <int> <chr>       
## 1    18    29 audi        
## 2    21    29 audi        
## 3    20    31 audi        
## 4    21    30 audi        
## 5    16    26 audi        
## 6    18    26 audi

We can combine filter() and select() through the pipe as well. For any pipe, the result of the “upstream” code (the code before the pipe) is passed into the function that follows the pipe.

mpg %>%
  filter(year >= 2005) %>%
  select(manufacturer, model, year, cty, hwy, class) %>%
  head()

## # A tibble: 6 x 6
##   manufacturer model       year   cty   hwy class  
##   <chr>        <chr>      <int> <int> <int> <chr>  
## 1 audi         a4          2008    20    31 compact
## 2 audi         a4          2008    21    30 compact
## 3 audi         a4          2008    18    27 compact
## 4 audi         a4 quattro  2008    20    28 compact
## 5 audi         a4 quattro  2008    19    27 compact
## 6 audi         a4 quattro  2008    17    25 compact

A Note on Pipes: The advantage to the pipe operator is probably pretty clear by now. The code we just wrote says take the mpg data set, filter it so that we only see cars manufactured since 2005, show me only the few columns I am interested in, and just let me see the first six rows for now. The alternative to this would be writing code that looks a lot less readable:

head(select(filter(mpg, year >= 2005), manufacturer, model, year, cty, hwy, class))

Summarizing Data

There are lots of ways we can summarize our data. We can provide simple counts, compute averages, even build out our own summary functions.

Computing Counts

We can start with a simple question like, how many cars from each manufacturer are contained in this dataset? To answer this, we simply pipe the mpg data frame into the count() function, identifying the manufacturer column as the column we wish to count.

mpg %>%
  count(manufacturer) %>%
  head()

## # A tibble: 6 x 2
##   manufacturer     n
##   <chr>        <int>
## 1 audi            18
## 2 chevrolet       19
## 3 dodge           37
## 4 ford            25
## 5 honda            9
## 6 hyundai         14

The counts are displayed in alphabetical order by manufacturer. We might be interested in the most well-represented manufacturers. We’ll do this with arrange() – we can pass this function the argument desc(n) to say that we want to arrange by our new count column in descending order, and let’s ask for the top 10 rows instead of the top 6.

mpg %>%
  count(manufacturer) %>%
  arrange(desc(n)) %>%
  head(n = 10)

## # A tibble: 10 x 2
##    manufacturer     n
##    <chr>        <int>
##  1 dodge           37
##  2 toyota          34
##  3 volkswagen      27
##  4 ford            25
##  5 chevrolet       19
##  6 audi            18
##  7 hyundai         14
##  8 subaru          14
##  9 nissan          13
## 10 honda            9

Let’s say we wanted to know how many different models of car each manufacturer has released since the year 2000. This is a more complicated question. We would first need to filter the data so that we are only considering cars manufactured since the year 2000. Then we would subset to include only the manufacturer and model columns. There are lots of duplicates here, so we would want to remove them with a function called distinct(), and then finally we could count occurrences within each manufacturer

mpg %>%
  filter(year >= 2000) %>%
  select(manufacturer, model) %>%
  distinct() %>%
  count(manufacturer) %>%
  arrange(desc(n)) %>%
  head()

## # A tibble: 6 x 2
##   manufacturer     n
##   <chr>        <int>
## 1 toyota           6
## 2 chevrolet        4
## 3 dodge            4
## 4 ford             4
## 5 volkswagen       4
## 6 audi             3

Summarizing Numerical Data

Summarizing categorical data is most often done with counts, but we’ve got many more choices when we are working with numerical data. We have several measures of center or spread that we could choose from – we could even define our own metrics. Let’s say we wanted to know the median highway mile per gallon rating across all vehicles in our dataset. We’ll need the help of R’s summarize() function as well as the median() function for this.

mpg %>% 
  summarize(median_hwy = median(hwy))

## # A tibble: 1 x 1
##   median_hwy
##        <dbl>
## 1         24

With the use of summarize() we can get multiple summaries at once. Let’s compute the mean and standard deviation for both the highway and city mile per gallon ratings across all of the vehicles in our data set.

mpg %>% 
  summarize(mean_hwy = mean(hwy), std_deviation_hwy = sd(hwy), mean_cty = mean(cty), std_deviation_cty = sd(cty))

## # A tibble: 1 x 4
##   mean_hwy std_deviation_hwy mean_cty std_deviation_cty
##      <dbl>             <dbl>    <dbl>             <dbl>
## 1     23.4              5.95     16.9              4.26

It might be useful if we could get grouped summary statistics. Let’s use group_by() to see how these measures vary across the different vehicle classes.

mpg %>%
  group_by(class) %>%
  summarize(mean_hwy = mean(hwy), std_deviation_hwy = sd(hwy), mean_cty = mean(cty), std_deviation_cty = sd(cty))

## # A tibble: 7 x 5
##   class      mean_hwy std_deviation_hwy mean_cty std_deviation_cty
##   <chr>         <dbl>             <dbl>    <dbl>             <dbl>
## 1 2seater        24.8              1.30     15.4             0.548
## 2 compact        28.3              3.78     20.1             3.39 
## 3 midsize        27.3              2.14     18.8             1.95 
## 4 minivan        22.4              2.06     15.8             1.83 
## 5 pickup         16.9              2.27     13               2.05 
## 6 subcompact     28.1              5.38     20.4             4.60 
## 7 suv            18.1              2.98     13.5             2.42

Let’s arrange the result here by mean highway mile per gallon rating in the default ascending order.

mpg %>%
  group_by(class) %>%
  summarize(mean_hwy = mean(hwy), std_deviation_hwy = sd(hwy), mean_cty = mean(cty), std_deviation_cty = sd(cty)) %>%
  arrange(mean_hwy)

## # A tibble: 7 x 5
##   class      mean_hwy std_deviation_hwy mean_cty std_deviation_cty
##   <chr>         <dbl>             <dbl>    <dbl>             <dbl>
## 1 pickup         16.9              2.27     13               2.05 
## 2 suv            18.1              2.98     13.5             2.42 
## 3 minivan        22.4              2.06     15.8             1.83 
## 4 2seater        24.8              1.30     15.4             0.548
## 5 midsize        27.3              2.14     18.8             1.95 
## 6 subcompact     28.1              5.38     20.4             4.60 
## 7 compact        28.3              3.78     20.1             3.39

That’s pretty informative although not totally surprising. Subcompact cars seem to have a high level of variation in their mpg ratings though!

Transforming Data

Often, you may be in a situation where you would like to create new columns, using the existing columns. This can be done using the mutate() command. The syntax is

dataset %>%
  mutate(new_column_name = function_of_old_columns)

In the mpg dataset, let’s add a column which is the ratio between the city cty and highway hwy gas milages, and use the arrange() function to find cars with the highest city to highway gas milages:

mpg %>%
  mutate(mpg_ratio = cty/hwy) %>%
  select(manufacturer,model,cty,hwy,mpg_ratio) %>%
  arrange(desc(mpg_ratio))

## # A tibble: 234 x 5
##    manufacturer model               cty   hwy mpg_ratio
##    <chr>        <chr>             <int> <int>     <dbl>
##  1 nissan       pathfinder 4wd       15    17     0.882
##  2 toyota       4runner 4wd          15    17     0.882
##  3 toyota       toyota tacoma 4wd    15    17     0.882
##  4 honda        civic                28    33     0.848
##  5 toyota       toyota tacoma 4wd    15    18     0.833
##  6 chevrolet    k1500 tahoe 4wd      14    17     0.824
##  7 dodge        dakota pickup 4wd    14    17     0.824
##  8 ford         explorer 4wd         14    17     0.824
##  9 ford         explorer 4wd         14    17     0.824
## 10 ford         f150 pickup 4wd      14    17     0.824
## # ... with 224 more rows

Once pretty common step in an analysis is to create a categorical column from a variable which was originally numeric. In order to do this we can use the if_else() function. The three arguments of if_else() are a condition, and the values you want to fill if the condition is true or false, respectively.

mpg %>%
  mutate(pre_2000 = if_else(year < 2000, "yes", "no")) %>%
  select(manufacturer,model,year,pre_2000)

## # A tibble: 234 x 4
##    manufacturer model       year pre_2000
##    <chr>        <chr>      <int> <chr>   
##  1 audi         a4          1999 yes     
##  2 audi         a4          1999 yes     
##  3 audi         a4          2008 no      
##  4 audi         a4          2008 no      
##  5 audi         a4          1999 yes     
##  6 audi         a4          1999 yes     
##  7 audi         a4          2008 no      
##  8 audi         a4 quattro  1999 yes     
##  9 audi         a4 quattro  1999 yes     
## 10 audi         a4 quattro  2008 no      
## # ... with 224 more rows

Final Thoughts

There is a lot more to learn about data manipulation and R in general. Sticking to the tidyverse and the other package groups within the tidy-ecosystem (ie. tidytext, tidymodels, etc.) will be beneficial because they are all built on common syntax and programmatic principles. You can read more about this in the TidyTools Manifesto.

You won’t be an expert after working through this document, but it should provide you with a solid start. Please feel free to add your own notes to this markdown file as we encounter more advanced functionality.

Previous, Installing and Accessing R Next, Data Visualization