Enter the Tidyverse: An Introduction to Tidy Data Analysis in R

Published

August 6, 2024

Objectives

This notebook addresses the following items.

How do I install and load packages in R? In particular we’ll work with the tidyverse.
How do I read data into R from both local and remote sources?
How do I interact with, and manipulate, data using the tools and principles of the tidyverse?

Installing and Loading Packages

We can install R packages using the command install.packages("PACKAGE_NAME"). Once packages are installed, we can load them into an R Session by running library(PACKAGE_NAME). While packages only need to be installed once, they must be loaded in each R Session you intend to use them in (note: an R Session begins when R/RStudio are opened and ends when they are closed or terminated). We can install and load the tidyverse by running the code below:

install.packages("tidyverse")
library(tidyverse)

Open RStudio and run these commands in the Console pane (left/lower-left). We’ll be using the kableExtra and tidymodels “packages” in our course – install both of these packages as well. Load the kableExtra package since we’ll be using it here.

Loading Data

Now that you have the tidyverse loaded, the next thing we’ll need is actual data to manipulate. The tidyverse comes with a few standard data sets for practicing with, but we’ll be much more interested in working with our own data which we’ll either find locally (stored on your own computer) or remotely (accessed via a web URL). The tidyverse includes several functions for reading data in a variety of formats:

read_csv("PATH_TO_FILE") can be used to read data from a comma separated values (csv) file.
read_delim("PATH_TO_FILE", delim = "DELIMITER") is a more general version of the read_csv() function – we can use this to read text files whose delimiter is something other than a comma. Common delimiters are the tab (\t) or space (\s).
read_excel("PATH_TO_FILE", sheet = "SHEET_NAME") can be used to read data from a particular sheet within an xls or xlsx file.

The following examples show how we can read a variety of files into an R Session.

#Read the MAT241 sheet from the grades.xls file in 
#the Spring 2021 folder on my computer's desktop
grades <- read_excel("C:/Users/agilb/Desktop/Spring 2021/grades.xls", sheet = "MAT241")

#Read in data from a csv file of Tate Gallery Artists housed 
#in a public github repository on the web
tate_artists <- read_csv("https://github.com/tategallery/collection/raw/master/artist_data.csv")

#Read in data from a csv file of Tate Gallery Artworks housed
#in a public github repository on the web
#*Note* that read_csv() would have worked just fine here too
tate_works <- read_delim("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-01-12/artwork.csv", delim = ",")

Viewing Data

Now that we’ve got data, the first thing we should do is look at it. There are a few really handy R functions for getting a feel for the data you have access to. The View(), head(), tail(), and glimpse() functions are four that are really commonly used. For the remainder of this notebook we’ll use a data frame called mpg which is built into the tidyverse.

Running View(mpg) will open a file viewer which allows you to navigate the data frame in a familiar spreadsheet format.
- Try to avoid this – especially with large data frames.

Using head(mpg) and tail(mpg) give us a convenient method for looking at the first six and last six rows of a data frame, respectively. This is typically enough to give us an idea of the type of data we are working with. Running both of these functions can also make us aware of potential inconsistencies in data collection.

head(mpg) %>%
  kable() %>%
  kable_styling()

manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact
audi	a4	2.8	1999	6	manual(m5)	f	18	26	p	compact

tail(mpg) %>%
  kable() %>%
  kable_styling()

manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
volkswagen	passat	1.8	1999	4	auto(l5)	f	18	29	p	midsize
volkswagen	passat	2.0	2008	4	auto(s6)	f	19	28	p	midsize
volkswagen	passat	2.0	2008	4	manual(m6)	f	21	29	p	midsize
volkswagen	passat	2.8	1999	6	auto(l5)	f	16	26	p	midsize
volkswagen	passat	2.8	1999	6	manual(m5)	f	18	26	p	midsize
volkswagen	passat	3.6	2008	6	auto(s6)	f	17	26	p	midsize

Note that the kable() %>% kable_styling() commands are used to produce visually appealing tables in our html output – they don’t actually do anything to transform our data. You are encouraged (though not required) to use these lines when you want to print out tabular output. You can see what the output looks like without using kableExtra below. I’ll continue to utilize kableExtra throughout our course.

head(mpg)

# A tibble: 6 × 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

tail(mpg)

# A tibble: 6 × 11
  manufacturer model  displ  year   cyl trans      drv     cty   hwy fl    class
  <chr>        <chr>  <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr>
1 volkswagen   passat   1.8  1999     4 auto(l5)   f        18    29 p     mids…
2 volkswagen   passat   2    2008     4 auto(s6)   f        19    28 p     mids…
3 volkswagen   passat   2    2008     4 manual(m6) f        21    29 p     mids…
4 volkswagen   passat   2.8  1999     6 auto(l5)   f        16    26 p     mids…
5 volkswagen   passat   2.8  1999     6 manual(m5) f        18    26 p     mids…
6 volkswagen   passat   3.6  2008     6 auto(s6)   f        17    26 p     mids…

Running glimpse(mpg) provides us with a bit more technical information about how R is interpreting the columns of the mpg data frame. Knowing how R is interpreting our variables (columns) is important because certain operations are possible with numerical data but are not possible with categorical data, and vice-versa. Common data types in R are chr/fct (categorical data) and num/dbl/int (numerical data).

glimpse(mpg)

Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
$ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
$ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

Manipulating Data

Now that we know how to load and view our data, let’s talk about manipulating it. We can restrict the data we are working with, produce summaries of the data, transform the data, and more.

Pipes `%>%`

Pipes are a functionality that is included in a package that is part of tidyverse library. At first, the syntax may seem a bit strange, but pipes allow you to easily manipulate data without having to rename and save the dataset along the way. I strongly encourage you get used to working with pipes! In the previous section we saw how to use R’s head() function to look at the first six rows of the dataset. Here’s how to achieve the same outcome with the use of the pipe (%>%) operator.

mpg %>%
  head() %>%
  kable() %>%
  kable_styling()

manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact
audi	a4	2.8	1999	6	manual(m5)	f	18	26	p	compact

You can read the code above as saying “take the mpg dataset, and plug it into the head() function”. Putting head() indented on a new line is not necessary for the code to work, but it does make the code easier to read. This new method of asking for the head() of the dataset may seem silly and inefficient, but the real magic of the pipe is that it allows us to chain operations together in a way that mimics the way humans think about instructions. We’ll see this in action as we get exposure to more data manipulation tools below.

Restricting Data

The most common methods for restricting data deal with filtering out rows or columns so that we are only working with a subset of our original data set.

Sometimes we are not interested in all of the observations in a particular dataset, but only those satisfying certain criteria. For example, maybe we only want to see vehicles falling into the class of subcompact cars. The filter() function will allow us to get rid of all other classes of vehicle.

mpg %>% 
  filter(class == "subcompact") %>%
  head() %>%
  kable() %>%
  kable_styling()

manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
ford	mustang	3.8	1999	6	manual(m5)	r	18	26	r	subcompact
ford	mustang	3.8	1999	6	auto(l4)	r	18	25	r	subcompact
ford	mustang	4.0	2008	6	manual(m5)	r	17	26	r	subcompact
ford	mustang	4.0	2008	6	auto(l5)	r	16	24	r	subcompact
ford	mustang	4.6	1999	8	auto(l4)	r	15	21	r	subcompact
ford	mustang	4.6	1999	8	manual(m5)	r	15	22	r	subcompact

We can also use more complex conditions on which rows to see using and (&) and or (|) statements. Maybe we want to see only those vehicles in the made by subaru or getting at least a 35 highway mile per gallon rating (hwy).

mpg %>% 
  filter(manufacturer == "subaru" | hwy >= 35) %>%
  head()  %>%
  kable() %>%
  kable_styling()

manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
honda	civic	1.8	2008	4	auto(l5)	f	25	36	r	subcompact
honda	civic	1.8	2008	4	auto(l5)	f	24	36	c	subcompact
subaru	forester awd	2.5	1999	4	manual(m5)	4	18	25	r	suv
subaru	forester awd	2.5	1999	4	auto(l4)	4	18	24	r	suv
subaru	forester awd	2.5	2008	4	manual(m5)	4	20	27	r	suv
subaru	forester awd	2.5	2008	4	manual(m5)	4	19	25	p	suv

Similarly to the way we can filter rows, we can select only those columns we are interested in. We can pass the names of the columns we are interested in to R’s select() function so that we only see those selected columns returned.

mpg %>%
  select(manufacturer, model, year, cty, hwy, class) %>%
  head()  %>%
  kable() %>%
  kable_styling()

manufacturer	model	year	cty	hwy	class
audi	a4	1999	18	29	compact
audi	a4	1999	21	29	compact
audi	a4	2008	20	31	compact
audi	a4	2008	21	30	compact
audi	a4	1999	16	26	compact
audi	a4	1999	18	26	compact

We can also select all columns except certain ones by preceding the column name with a -.

mpg %>%
  select(-displ,-cyl) %>%
  head()  %>%
  kable() %>%
  kable_styling()

manufacturer	model	year	trans	drv	cty	hwy	fl	class
audi	a4	1999	auto(l5)	f	18	29	p	compact
audi	a4	1999	manual(m5)	f	21	29	p	compact
audi	a4	2008	manual(m6)	f	20	31	p	compact
audi	a4	2008	auto(av)	f	21	30	p	compact
audi	a4	1999	auto(l5)	f	16	26	p	compact
audi	a4	1999	manual(m5)	f	18	26	p	compact

The select() function is also useful for changing the order of the columns.

mpg %>%
  select(cty, hwy, manufacturer) %>%
  head() %>%
  kable() %>%
  kable_styling()

cty	hwy	manufacturer
18	29	audi
21	29	audi
20	31	audi
21	30	audi
16	26	audi
18	26	audi

We can combine filter() and select() through the pipe as well. For any pipe, the result of the “upstream” code (the code before the pipe) is passed into the function that follows the pipe.

mpg %>%
  filter(year >= 2005) %>%
  select(manufacturer, model, year, cty, hwy, class) %>%
  head() %>%
  kable() %>%
  kable_styling()

manufacturer	model	year	cty	hwy	class
audi	a4	2008	20	31	compact
audi	a4	2008	21	30	compact
audi	a4	2008	18	27	compact
audi	a4 quattro	2008	20	28	compact
audi	a4 quattro	2008	19	27	compact
audi	a4 quattro	2008	17	25	compact

A Note on Pipes: The advantage to the pipe operator is probably pretty clear by now. The code we just wrote says take the mpg data set, and then filter it so that we only see cars manufactured since 2005, and then show me only the few columns I am interested in, and then just let me see the first six rows for now. The alternative to this would be writing code that looks a lot less readable:

head(select(filter(mpg, year >= 2005), manufacturer, model, year, cty, hwy, class))

Summarizing Data

There are lots of ways we can summarize our data. We can provide simple counts, compute averages, even build out our own summary functions.

We can start with a simple question like, how many cars from each manufacturer are contained in this dataset? To answer this, we simply pipe the mpg data frame into the count() function, identifying the manufacturer column as the column we wish to count.

mpg %>%
  count(manufacturer) %>%
  head() %>%
  kable() %>%
  kable_styling()

manufacturer	n
audi	18
chevrolet	19
dodge	37
ford	25
honda	9
hyundai	14

The counts are displayed in alphabetical order by manufacturer. We might be interested in the most well-represented manufacturers. We’ll do this with arrange() – we can pass this function the argument desc(n) to say that we want to arrange by our new count column in descending order, and let’s ask for the top 10 rows instead of the top 6.

mpg %>%
  count(manufacturer) %>%
  arrange(desc(n)) %>%
  head(n = 10) %>%
  kable() %>%
  kable_styling()

manufacturer	n
dodge	37
toyota	34
volkswagen	27
ford	25
chevrolet	19
audi	18
hyundai	14
subaru	14
nissan	13
honda	9

Let’s say we wanted to know how many different models of car each manufacturer has released since the year 2000. This is a more complicated question. We would first need to filter the data so that we are only considering cars manufactured since the year 2000. Then we would subset to include only the manufacturer and model columns. There are lots of duplicates here, so we would want to remove them with a function called distinct(), and then finally we could count occurrences within each manufacturer

mpg %>%
  filter(year >= 2000) %>%
  select(manufacturer, model) %>%
  distinct() %>%
  count(manufacturer) %>%
  arrange(desc(n)) %>%
  head() %>%
  kable() %>%
  kable_styling()

manufacturer	n
toyota	6
chevrolet	4
dodge	4
ford	4
volkswagen	4
audi	3

Summarizing categorical data is most often done with counts, but we’ve got many more choices when we are working with numerical data. We have several measures of center or spread that we could choose from – we could even define our own metrics. Let’s say we wanted to know the median highway mile per gallon rating across all vehicles in our dataset. We’ll need the help of R’s summarize() function as well as the median() function for this.

mpg %>% 
  summarize(median_hwy = median(hwy)) %>%
  kable() %>%
  kable_styling()

median_hwy
24

With the use of summarize() we can get multiple summaries at once. Let’s compute the mean and standard deviation for both the highway and city mile per gallon ratings across all of the vehicles in our data set.

mpg %>% 
  summarize(mean_hwy = mean(hwy), std_deviation_hwy = sd(hwy), mean_cty = mean(cty), std_deviation_cty = sd(cty)) %>%
  kable() %>%
  kable_styling()

mean_hwy	std_deviation_hwy	mean_cty	std_deviation_cty
23.44017	5.954643	16.85897	4.255946

It might be useful if we could get grouped summary statistics. Let’s use group_by() to see how these measures vary across the different vehicle classes.

mpg %>%
  group_by(class) %>%
  summarize(mean_hwy = mean(hwy), std_deviation_hwy = sd(hwy), mean_cty = mean(cty), std_deviation_cty = sd(cty)) %>%
  kable() %>%
  kable_styling()

class	mean_hwy	std_deviation_hwy	mean_cty	std_deviation_cty
2seater	24.80000	1.303840	15.40000	0.5477226
compact	28.29787	3.781620	20.12766	3.3854999
midsize	27.29268	2.135930	18.75610	1.9465416
minivan	22.36364	2.062655	15.81818	1.8340219
pickup	16.87879	2.274280	13.00000	2.0463382
subcompact	28.14286	5.375012	20.37143	4.6023377
suv	18.12903	2.977973	13.50000	2.4208791

Let’s arrange the result here by mean highway mile per gallon rating in the default ascending order.

mpg %>%
  group_by(class) %>%
  summarize(mean_hwy = mean(hwy), std_deviation_hwy = sd(hwy), mean_cty = mean(cty), std_deviation_cty = sd(cty)) %>%
  arrange(mean_hwy) %>%
  kable() %>%
  kable_styling()

class	mean_hwy	std_deviation_hwy	mean_cty	std_deviation_cty
pickup	16.87879	2.274280	13.00000	2.0463382
suv	18.12903	2.977973	13.50000	2.4208791
minivan	22.36364	2.062655	15.81818	1.8340219
2seater	24.80000	1.303840	15.40000	0.5477226
midsize	27.29268	2.135930	18.75610	1.9465416
subcompact	28.14286	5.375012	20.37143	4.6023377
compact	28.29787	3.781620	20.12766	3.3854999

That’s pretty informative although not totally surprising. Subcompact cars seem to have a high level of variation in their mpg ratings though!

Transforming Data

Often, you may be in a situation where you would like to create new columns, using the existing columns. This can be done using the mutate() command. The syntax is

dataset %>%
  mutate(new_column_name = function_of_old_columns)

In the mpg dataset, let’s add a column which is the ratio between the city cty and highway hwy gas milages, and use the arrange() function to find cars with the highest city to highway gas milages:

mpg %>%
  mutate(mpg_ratio = cty/hwy) %>%
  select(manufacturer,model,cty,hwy,mpg_ratio) %>%
  arrange(desc(mpg_ratio)) %>%
  head() %>%
  kable() %>%
  kable_styling()

manufacturer	model	cty	hwy	mpg_ratio
nissan	pathfinder 4wd	15	17	0.8823529
toyota	4runner 4wd	15	17	0.8823529
toyota	toyota tacoma 4wd	15	17	0.8823529
honda	civic	28	33	0.8484848
toyota	toyota tacoma 4wd	15	18	0.8333333
chevrolet	k1500 tahoe 4wd	14	17	0.8235294

One pretty common step in an analysis is to create a categorical column from a variable which was originally numeric. In order to do this we can use the if_else() function. The three arguments of if_else() are a condition, and the values you want to fill if the condition is true or false, respectively.

mpg %>%
  mutate(pre_2000 = if_else(year < 2000, "yes", "no")) %>%
  select(manufacturer,model,year,pre_2000) %>%
  head() %>%
  kable() %>%
  kable_styling()

manufacturer	model	year	pre_2000
audi	a4	1999	yes
audi	a4	1999	yes
audi	a4	2008	no
audi	a4	2008	no
audi	a4	1999	yes
audi	a4	1999	yes

Final Thoughts

There is a lot more to learn about data manipulation and R in general. Sticking to the tidyverse and the other package groups within the tidy-ecosystem (ie. tidytext, tidymodels, etc.) will be beneficial because they are all built on common syntax and programmatic principles. You can read more about this in the TidyTools Manifesto.

You won’t be an expert after working through this document, but it should provide you with a solid start. Please feel free to add your own notes to this markdown file as we encounter more advanced functionality.