{plotnine}The goal of any exploratory data analysis (EDA) is to learn about and
from our data. You did some of that in Days 2 and 3, where you looked at
techniques for data wrangling in Python and also inline reporting within
R Markdown documents with the help of {reticulate}. While
summary statistics are great, sometimes visuals are even more effective
at helping us (and our readers) understand the data we are working with.
This notebook will focus on data visualization using
{plotnine}, a python module built to mirror the performance
of R’s {ggplot2} package. Both {plotnine} and
{ggplot2} are built using a layered grammar of
graphics which allows for the convenient construction of plots by
adding new layers on top of all previous layers.
Open RStudio and the notebook you’ve been working on over the past two class meetings. Make sure you are working inside of the R Project which is managing your GitHub repository associated with MAT434.
If you’ve already split your data into training and validation sets (and you have a strong grasp of why we are doing that), you can skip to the next section of this notebook. If you haven’t done this before, then continue along with me here.
Our goal in MAT434 is to build models which help us classify observations as belonging to groups. There are two reasons we might do this.
If we (and our model) get to see all of our available data during the learning/training process, then we can’t hope to know how well our model generalizes to new observations until it is too late! For this reason, we’ll hide some data from ourselves and our models – we’ll refer to this hidden data as validation data.
At the beginning of any analysis, if one of our goals is a model to
be used for predictions, then we should split our data into
training and test sets. We can do this with the
train_test_split() function from the
sklearn.model_selection submodule. We’ll need to
import this before we can use it.
Add code to the setup chunk of your notebook to load
the {reticulate} package and to use the mat434
virtual environment you’ve built.
If you haven’t installed sklearn into your virtual
environment, you can do so by running
reticulate::virtualenv_install("mat434", "scikit-learn") in
your setup R code chunk. Remember to delete this line of
code after you’ve run it.
Now we’ll split our data into training and
test sets.
from sklearn.model_selection import train_test_split
train, test = train_test_split(<name_of_your_dataframe>, train_size = 0.75, random_state = 434)
train.head()The first line imports the train_test_split() function
from the sklearn.model_selection submodule. The next line
splits your data frame into two random subsets (by rows), where the
training set contains 75% of observations and the
test set contains the remaining 25%. Setting the
random_state parameter ensures that we’ll get the same
random training and test sets regardless of
when, where, or who runs the code.
In general, we’ll do this near the beginning of any analysis. If you completed your tidy analysis work on your full data set, you can add this to the bottom of your notebook and just make note that we’ll split into training and test sets at the beginning of our analyses from here on out.
{plotnine} for Data VizAs mentioned in the opening, the {plotnine} module was
built to mirror the functionality of R’s {ggplot2} data
visualization package. This means that, regardless of whether you are
using R or Python, much of our plotting code will looks similar.
In order to use {plotnine} we’ll need to import it.
Rather than importing the entire module, it is more common to import
only the functionality that we’ll need. This also prevents the need to
namespace every function call and makes our plotting code look even more
similar to the {ggplot2} functionality.
import the following
functions from {plotnine}: ggplot,
geom_point, geom_boxplot,
geom_bar, aes, labs. There’s lots
more {plotnine} functionality – you can see the
{plotnine} documentation here.The ggplot() function provides us with a layered
plotting syntax based off of the grammar of graphics. Plotting
layers can include plot types (geometries), labels, themes,
and more. Layers of a ggplot() object are separated by
+ signs rather than the dot notation, although the behavior
is similar.
A basic ggplot() might look like the following:
(
ggplot(data = train) +
  geom_boxplot(aes(x = 0, y = "pitch_mph")) +
  labs(x = "",
       y = "Pitch Speed (mph)",
       title = "Distribution of Pitch Speed in Miles per Hour")
)Notice that the entire plotting code is wrapped in parentheses
() to indicate to Python that all of this code should be
executed at once, rather than a single line at a time, as is usually the
case with Python.
pitch_mph with that
variable in the plotting code.At a minimum, every plot will need data. The way to pass
data to a plot is using the data argument to
ggplot(). Every plot also needs to include a geometry layer
(geom_*()), which will include
aesthetics – variables from data
which determine features of the plot. If you want to override an
attribute of the visual across the entire plot (say, set the
fill color of a boxplot to purple, then you can
set that parameter outside of aes() but still inside of the
geometry layer). That is,
(
  ggplot(data = train) +
  geom_boxplot(aes(x = 0, y = "pitch_mph"), fill = "purple") +
  labs(x = "",
       y = "Pitch Speed (mph)",
       title = "Distribution of Pitch Speed in Miles per Hour")
)In general, plot aspects determined by variables in the data set must
be defined inside of aes(), while aspects which are
globally defined for the plot layer should be defined outside of
aes().
fill attribute inside
of aes() instead of outside?For now, having some basic rules of thumb for plot geometries might be helpful:
Univariate plots are plots in which a single variable is utilized.
Univariate geometries for a numerical variable:
geom_histogram())geom_boxplot())geom_density())Univariate geometries for a categorical variable:
geom_bar())
or Column plots (geom_col())Multivariate plots are plots involving multiple variables:
Bivariate plots between two numerical variables:
geom_point()
or geom_jitter())geom_bin_2d()Bivariate plots between one numerical and one categorical variable:
Any of the univariate geometries for a single numerical variable
with the categorical variable as a fill color.
alpha parameter to control transparency. You
can also facet with facet_wrap()
to obtain a different plot for each class of the categorical
variable.Bivariate plots between two categorical variables:
geom_bar() using fill
argument for second categorical variable – position
argument is also helpful for organizing bars)Multivariate plots with three or more variables:
facet_wrap() or facet_grid() to split
plots across categorical variables without cluttering individual plot
panels with more information.color, fill, shape, and/or
size to show additional variables.I don’t think you’ll learn much more by reading. Let’s start
doing. Think of some plots that you’d like to build that will
help you better understand your data. Briefly discuss those plots and
what you hope they’ll tell you, and then build them into your notebook!
Just remember to use your training when building these
plots.
One of the really awesome things about ggplot() is that
we can very easily create complex plots because of its layered plotting
functionality. This means that we can include lots of geometry layers in
a single plot, helping us understand a variable or collection of
variables even better than a single visual type would facilitate
for.
You want a plot that shows the density, a boxplot, and individual observed values of a data set across several categories all at once? No problem! Try and re-create the plot below if you like, or try building something similar.
If you haven’t done so already, bookmark Ced
Scherer’s detailed {ggplot2} tutorial. If you are
interested in data visualization, I strongly recommend returning to this
and building at least a small portfolio of data visualizations to
share!