ggplot()
Data visualization is an extremely important part of any analysis. As a researcher or analyst, data visualization will likely bookend your work, serving as a reservoir for idea and hypothesis generation at the beginning of a project as well as becoming a vehicle for conveying important takeaways as you summarize and disseminate your results upon completion.
In R, we have access to an extremely popular an easy-to-use plotting
library, called ggplot2
. The main philosophy behind
ggplot2
is that data visualization should follow a layered
grammar
of graphics. The grammar of graphics is a notion first
introduced by Leland Wilkinson in 2006, and the layering functionality
is a convenient implementation in ggplot2
.
ggplot()
In this introduction to data visualization, we will be using
ggplot2
, which is part of the tidyverse
package. To begin, we will use a pipe to feed a dataset into
ggplot()
. As a sandbox, we will use the
penguins
dataset, which is part of the
palmerpenguins
library. Don’t forget to run
install.packages("palmerpenguins")
if you have not
installed this package before. The penguins
dataset
contains measurements from different types of penguins that were
collected by Gorman, Williams, and Fraser that were published in
2014:
library(palmerpenguins)
%>%
penguins head() %>%
kable() %>%
kable_styling(bootstrap_options = c("hover", "striped"))
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|
Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
Adelie | Torgersen | NA | NA | NA | NA | NA | 2007 |
Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
Recall from our introduction to the tidyverse
that we
can read the pipe below as saying “plug the penguins
dataset into the ggplot()
function”.
library(tidyverse)
%>%
penguins ggplot()
The function ggplot()
calls (or initializes) a plot
object but won’t actually create anything interesting by itself – just
an empty rectangle. Next, we need to add a geom
etry layer
to our plot and pass it the appropriate aes
thetics. The
geometry layer determines the type of plot we will build and the
aesthetics pass the information required from our dataset to construct
the plot. You can find some of the most common geometry layers (or
geoms) and their [minimum] required aesthetics below.
geom_point()
will create a scatterplot, requiring
x
and y
aesthetics.geom_line()
will create a linegraph, requiring
x
and y
aesthetics.geom_boxplot()
will create a boxplot, requiring an
x
aesthetic.geom_histogram()
will create a histogram, requiring an
x
aesthetic.geom_bar()
will create a bargraph, requiring an
x
aesthetic.Let’s create a scatterplot between bill_length_mm
and
bill_depth_mm
using the geom_point()
layer.
%>%
penguins ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm))
The ggplot()
function assumes by default that the first
and second argument in aes()
correspond to the x- and
y-variables, so you may also write
aes(bill_length_mm, bill_depth_mm)
to save some typing.
Each layer in a ggplot
is added to the existing plot object
using a +
sign. This allows us to combine and customize
plots quite easily, as you will see further on in this notebook.
Additional aesthetics can typically be added in order to better
visualize the data or to add more information into our plot. For
example, we might wonder whether the relationship between bill length
and bill depth is different among the various penguin species. We can
color the points in the scatter plot by adding
color = species
to the aes()
function. You
might try size = species
as an aesthetic as well.
%>%
penguins ggplot() +
geom_point(aes(bill_length_mm, bill_depth_mm, color = species))
Oftentimes, you may want to preprocess the data set before feeding it
into a plot object. For example, let’s say that we wanted to see if
there was any trend in the number of penguin observations each year.
This information does not already appear as a column in the dataset, so
we will need to use the count()
function to create it
first.
%>%
penguins count(year) %>%
#Recall that the count() function makes a new column named n which contains the
#number of observations for each year
ggplot() +
geom_line(aes(year, n)) +
geom_point(aes(year, n))
In the visualization above, two layers were added- a scatter plot
which created the points, and a line graph which created the lines that
connect the points. Notice that all we needed to do was “add” the layers
with the +
symbol.
If we wanted to save a bit of typing, because both plotting layers
map the same aesthetics to the same types of objects, we can
pass aes(year, n))
to once to ggplot()
rather
than to the individual plot layers.
Data visualizations have become commonplace in our society as data collection has increased. There are many types of visualizations available to us- scatter plots, box plots, mosaic plots, and word clouds, just to name a few. How can we decide what type of visualization is appropriate for a dataset? The answer depends largely on the type of data that you are working with. Recall that a numerical variable is a variable which is obtained through measurement, and for which comparing values of the variable makes sense (think – would an average of these values be meaningful?). Examples of numerical variables include heights, distances, temperatures, budget sizes, etc. A categorical variable is a variable which sorts values into “bins”. Examples of categorical variables include a person’s sex, the answer to a yes or no question, a student’s grade level, etc. Whether to consider a variable to be numerical or categorical is not always obvious and may depend upon the context of the dataset.
A single numerical variable can be visualized using a histogram, a
box plot, or a violin plot. Let’s consider a the bill length
bill_length_mm
of Gentoo penguins. A
histogram takes the various bill lengths of the
penguins (between 40.9 mm and 59.6 mm in this case), and groups them
together into bins of equal length. Then, a bar is drawn over each bin
whose height reflects the number (or proportion) of penguins whose bill
length lies within the bin. The bins =
argument in the
geom_histogram()
layer controls the number of bins that are
created. Notice that the bins
argument falls outside of the
aes()
parentheses because its value is not being inherited
from a column in our dataset.
%>%
penguins filter(!is.na(bill_length_mm),
== "Gentoo") %>%
species ggplot() +
geom_histogram(aes(x = bill_length_mm), bins = 20)
Often, we would like to see how close a histogram is to a Normal
distribution (the bell-shaped curve). To do this, we will add a
stat_function
layer which draws a Normal curve with mean
and standard deviation equal to the mean and standard deviation of the
data set. In this case, we will let the y-coordinate be
..density..
which measures the proportion of penguins which
lie in each bin, rather than the number of penguins. The arguments of
the stat_function()
layer are fun =
, which
tells R what type of distribution to draw and args =
, which
provides R with the mean and standard deviation of the distribution.
%>%
penguins filter(!is.na(bill_length_mm),
== "Gentoo") %>%
species ggplot() +
geom_histogram(aes(x = bill_length_mm, y = ..density..), bins = 20) +
stat_function(fun = dnorm,
args = list(mean = penguins %>%
filter(!is.na(bill_length_mm),
== "Gentoo") %>%
species pull(bill_length_mm) %>%
mean(),
sd = penguins %>%
filter(!is.na(bill_length_mm),
== "Gentoo") %>%
species pull(bill_length_mm) %>%
sd()),
color = "red")
A box plot is helpful when your goal is to visualize the “center” of the data (using the median) and to get a sense of how spread out the data is.
%>%
penguins filter(!is.na(bill_length_mm),
== "Gentoo") %>%
species ggplot() +
geom_boxplot(aes(x = bill_length_mm))
The line in the interior of the white box corresponds to the median value of the bill length variable, which is 47.3 mm in this example. The length of the white box is the interquartile range. All values between the 25th and 75 percentiles of the data lie in this range. The lines (“whiskers”) coming from the interquartile range extend out to either i) the minimum or maximum value in the data, or ii) to 1.5 times the interquartile range. Any values that extend beyond 1.5 interquartile ranges away from the median are marked with dots.
You’ve probably already noticed some advantages to the
ggplot()
framework. Creating new types of plots are easy
because the syntax is consistent. Switching from a histogram to a box
plot is as simple as changing the type of layer being added to the
plot!
In order to visualize a single categorical variable, a bar plot is appropriate. The bar plot counts the number of occurrences of each level of the categorical variable. For example, we can explore whether there are similar numbers of penguins of each species contained within the dataset:
%>%
penguins filter(!is.na(species)) %>%
ggplot() +
geom_bar(aes(x = species))
To make a more visually appealing bar plot, we can have the bars be
different colors using the fill =
aesthetic. I usually add
a show.legend = FALSE
argument to the
geom_bar()
layer so that we don’t have redundant
information in the plot. (You can try it both ways to see the
difference).
%>%
penguins filter(!is.na(species)) %>%
ggplot() +
geom_bar(aes(x = species, fill = species), show.legend = FALSE)
If you would like to visualize the relationship between two numerical
variables, you should use a scatter plot. For example, we could consider
the relationship between the bill length (bill_length_mm
)
and bill depth (bill_depth_mm
) of the penguins in the
dataset:
%>%
penguins filter(!is.na(bill_length_mm)) %>%
ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm))
Each point on the scatter plot corresponds to a single penguin. The x- and y- coordinate of each point is equal to the bill length and bill depth of the corresponding penguin.
To describe the relationship between a categorical and a numerical
variable, use side-by-side box plots. Does the bill length of a penguin
change based on its species? Recall from earlier that, at a minimum, a
boxplot requires an x
aesthetic. Here we will pass it both
an x
and y
aesthetic, where x
is
a grouping variable (categorical) and y
is the numerical
variable we are interested in visualizing across the groups.
%>%
penguins filter(!is.na(bill_length_mm)) %>%
ggplot() +
geom_boxplot(aes(x = species, y = bill_length_mm))
We can also use violin plots to visualize the relationship. Violin plots are sort of a mash-up between a box plot and a histogram, where the sides of the violin plot are smoothed versions of the bars from the histogram of the data. The width of the violin plot corresponds to the proportion of the data that falls within that range.
%>%
penguins filter(!is.na(bill_length_mm)) %>%
ggplot() +
geom_violin(aes(x = species, y = bill_length_mm))
To visualize the relationship between two categorical variables, you
can use stacked bar plots or mosaic plots. A stacked bar plot separates
the bars of a bar plot according to the values of the second categorical
variable. For example, we can plot the species of a penguin against its
sex by adding the fill = sex
aesthetic:
%>%
penguins filter(!is.na(sex)) %>%
ggplot() +
geom_bar(aes(x = species, fill = sex))
A mosaic plot creates largely the same graphic, but makes it easier
to see the proportions of each level of one variable that belong to a
level of the other variable. The width of each box corresponds to the
proportion of the overall dataset which has values that lie in the
intersection of those variables. From the mosaic plot, we can see that
males are slightly over-represented among Gentoo
penguins
relative to the other species. The code below requires installing and
loading the ggmosaic
package. Notice also that the
aesthetic argument is also slightly different for mosaic plots. The
variables which you would like to have plotted against each other should
be wrapped in a product()
function within the aesthetics
for geom_mosaic()
layer.
library(ggmosaic)
%>%
penguins filter(!is.na(sex)) %>%
ggplot() +
geom_mosaic(aes(product(sex,species), fill = sex))
It is certainly not the end of the world if you choose a type of plot that is not appropriate for the variables that you are considering, but it is an important skill to be able to identify when an inappropriate plot has been created. I would like to give you some examples of poorly chosen plots, where the plots do not provide the reader with very much (or any) information:
%>%
penguins filter(!is.na(species)) %>%
ggplot() +
geom_point(aes(x = bill_length_mm, y = species))
%>%
penguins ggplot() +
geom_bar(aes(x = bill_length_mm))
%>%
penguins filter(!is.na(sex)) %>%
ggplot() +
geom_point(aes(x = sex, y = species))
From the examples above, you should see that understanding the type of variable that you are considering is an important part of data visualization!
There are many options for plot customization! One of the most common
customizations is in the axis labels and plot title, which can be
controlled with the labs()
layer. Additionally, if you are
building a plot with multiple layers, and those layers share common
aesthetics, you can place the aesthetics into the original
ggplot()
call rather than duplicating them across the
geometry layers.
%>%
penguins count(year) %>%
ggplot(aes(x = year, y = n)) +
geom_line() +
geom_point() +
labs(x = "Year", y = "Number of Penguins Observed", title = "Penguin Observations 2007-2009")
The tick marks in this plot are not great, since it doesn’t make
(from an animal cruelty perspective) sense to consider half of a
penguin. We can adjust the tick marks using the
scale_x_continuous()
and scale_y_continuous()
layers. The breaks =
argument takes a list of the tick
marks that you would like the axis to have. The seq()
function has the syntax seq(min, max, by)
and creates a
list of numbers from min
to max
counting by
increments sized according to the by
argument.
%>%
penguins count(year) %>%
ggplot(aes(year, n)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = seq(2007,2009,1)) +
scale_y_continuous(breaks = seq(110,120,2)) +
labs(x = "Year", y = "Number of Penguins Observed") +
ggtitle("Penguin Observations 2007-2009")
It is also easy to make separate plots for each of the species of
penguin, using the color
aesthetic,
%>%
penguins count(year, species) %>%
ggplot(aes(year, n, color = species)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = seq(2007,2009,1)) +
scale_y_continuous(breaks = seq(10,60,10)) +
labs(x = "Year", y = "Number of Penguins Observed", color = "Species") +
ggtitle("Penguin Observations 2007-2009")
or by the sex of the penguin. In this case, I used a
scale_color_discrete()
layer to adjust the labels of the
levels of the sex
variable to capitalize them, and change
NA
to “Unknown”. Notice that we passed two columns to the
count()
function here, indicating that we wanted counts
computed for each unique combination of sex
and
year
.
%>%
penguins count(year, sex) %>%
ggplot(aes(year, n, color = sex)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = seq(2007,2009,1)) +
scale_y_continuous(breaks = seq(10,60,10)) +
scale_color_discrete(labels = c("Female","Male","Unknown")) +
labs(x = "Year", y = "Number of Penguins Observed", color = "Sex") +
ggtitle("Penguin Observations 2007-2009")
It is also possible to make separate plots for each level of a
categorical variable, using a facet_grid()
layer.
%>%
penguins count(year, sex) %>%
ggplot(aes(year, n, color = sex)) +
geom_line(show.legend = FALSE) +
geom_point(show.legend = FALSE) +
facet_grid(sex~., labeller = as_labeller(c(female = "Female", male = "Male"))) +
scale_x_continuous(breaks = seq(2007,2009,1)) +
labs(x = "Year", y = "Number of Penguins Observed") +
ggtitle("Penguin Observations 2007-2009")
I’ve given you a ton of information here- please don’t feel
overwhelmed! I am intending this document to be a place that you can
come back to reference and copy and paste from. There is also a lot of
support online through the ggplot2
documentation, the R
Graph Gallery, and sites like Stack Overflow where you can get
additional information and tips on creating data visualizations using R.
If you want a full course on tricking out your plots with
ggplot()
, check out this
extensive tutorial from Ced Scherer.