Getting Started and Basics

The goal of any exploratory data analysis (EDA) is to learn about and from our data. You did some of that in Days 2 and 3, where you looked at techniques for data wrangling in Python and also inline reporting within R Markdown documents with the help of {reticulate}. While summary statistics are great, sometimes visuals are even more effective at helping us (and our readers) understand the data we are working with. This notebook will focus on data visualization using {plotnine}, a python module built to mirror the performance of R’s {ggplot2} package. Both {plotnine} and {ggplot2} are built using a layered grammar of graphics which allows for the convenient construction of plots by adding new layers on top of all previous layers.

Open RStudio and the notebook you’ve been working on over the past two class meetings. Make sure you are working inside of the R Project which is managing your GitHub repository associated with MAT434.

Training and Validation Data

If you’ve already split your data into training and validation sets (and you have a strong grasp of why we are doing that), you can skip to the next section of this notebook. If you haven’t done this before, then continue along with me here.

Our goal in MAT434 is to build models which help us classify observations as belonging to groups. There are two reasons we might do this.

  • We want to understand which features are associated with group membership, and how.
  • We want to build a model which will accurately predict group membership for new, previously unseen observations.

If we (and our model) get to see all of our available data during the learning/training process, then we can’t hope to know how well our model generalizes to new observations until it is too late! For this reason, we’ll hide some data from ourselves and our models – we’ll refer to this hidden data as validation data.

At the beginning of any analysis, if one of our goals is a model to be used for predictions, then we should split our data into training and test sets. We can do this with the train_test_split() function from the sklearn.model_selection submodule. We’ll need to import this before we can use it.

  1. Add code to the setup chunk of your notebook to load the {reticulate} package and to use the mat434 virtual environment you’ve built.

  2. If you haven’t installed sklearn into your virtual environment, you can do so by running reticulate::virtualenv_install("mat434", "scikit-learn") in your setup R code chunk. Remember to delete this line of code after you’ve run it.

Now we’ll split our data into training and test sets.

  1. Open a new Python code cell and type/run the following commands.
from sklearn.model_selection import train_test_split

train, test = train_test_split(<name_of_your_dataframe>, train_size = 0.75, random_state = 434)

train.head()

The first line imports the train_test_split() function from the sklearn.model_selection submodule. The next line splits your data frame into two random subsets (by rows), where the training set contains 75% of observations and the test set contains the remaining 25%. Setting the random_state parameter ensures that we’ll get the same random training and test sets regardless of when, where, or who runs the code.

In general, we’ll do this near the beginning of any analysis. If you completed your tidy analysis work on your full data set, you can add this to the bottom of your notebook and just make note that we’ll split into training and test sets at the beginning of our analyses from here on out.

About {plotnine} for Data Viz

As mentioned in the opening, the {plotnine} module was built to mirror the functionality of R’s {ggplot2} data visualization package. This means that, regardless of whether you are using R or Python, much of our plotting code will looks similar.

In order to use {plotnine} we’ll need to import it. Rather than importing the entire module, it is more common to import only the functionality that we’ll need. This also prevents the need to namespace every function call and makes our plotting code look even more similar to the {ggplot2} functionality.

  1. Either open a new Python code chunk or (preferrably), go back to your initial Python code chunk and import the following functions from {plotnine}: ggplot, geom_point, geom_boxplot, geom_bar, aes, labs. There’s lots more {plotnine} functionality – you can see the {plotnine} documentation here.

The ggplot() function provides us with a layered plotting syntax based off of the grammar of graphics. Plotting layers can include plot types (geometries), labels, themes, and more. Layers of a ggplot() object are separated by + signs rather than the dot notation, although the behavior is similar.

A basic ggplot() might look like the following:

(
ggplot(data = train) +
  geom_boxplot(aes(x = 0, y = "pitch_mph")) +
  labs(x = "",
       y = "Pitch Speed (mph)",
       title = "Distribution of Pitch Speed in Miles per Hour")
)

Notice that the entire plotting code is wrapped in parentheses () to indicate to Python that all of this code should be executed at once, rather than a single line at a time, as is usually the case with Python.

  1. Add a new code cell to your notebook, type the code above into the cell, run it, and discuss the resulting plot.
  • If you are working with the FAA Birdstrikes data, then choose a numerical variable from that data set that you are interested in building a boxplot for and replace pitch_mph with that variable in the plotting code.

Basic Plotting

At a minimum, every plot will need data. The way to pass data to a plot is using the data argument to ggplot(). Every plot also needs to include a geometry layer (geom_*()), which will include aesthetics – variables from data which determine features of the plot. If you want to override an attribute of the visual across the entire plot (say, set the fill color of a boxplot to purple, then you can set that parameter outside of aes() but still inside of the geometry layer). That is,

(
  ggplot(data = train) +
  geom_boxplot(aes(x = 0, y = "pitch_mph"), fill = "purple") +
  labs(x = "",
       y = "Pitch Speed (mph)",
       title = "Distribution of Pitch Speed in Miles per Hour")
)

In general, plot aspects determined by variables in the data set must be defined inside of aes(), while aspects which are globally defined for the plot layer should be defined outside of aes().

  1. Update the fill color of the boxplot in your notebook. Use any color you like. What happens if you set the fill attribute inside of aes() instead of outside?

Helpful Rules of Thumb

For now, having some basic rules of thumb for plot geometries might be helpful:

  • Univariate plots are plots in which a single variable is utilized.

  • Multivariate plots are plots involving multiple variables:

    • Bivariate plots between two numerical variables:

    • Bivariate plots between one numerical and one categorical variable:

      • Any of the univariate geometries for a single numerical variable with the categorical variable as a fill color.

        • Often this results in overplotting, which can obscure the visual. You can use an alpha parameter to control transparency. You can also facet with facet_wrap() to obtain a different plot for each class of the categorical variable.
    • Bivariate plots between two categorical variables:

      • Barplots with Fill (geom_bar() using fill argument for second categorical variable – position argument is also helpful for organizing bars)
    • Multivariate plots with three or more variables:

      • Be careful not to include too much information in your plots – more is not always better.
      • Use facet_wrap() or facet_grid() to split plots across categorical variables without cluttering individual plot panels with more information.
      • Experiment with the plot types described above, making use of color, fill, shape, and/or size to show additional variables.

Next Steps

I don’t think you’ll learn much more by reading. Let’s start doing. Think of some plots that you’d like to build that will help you better understand your data. Briefly discuss those plots and what you hope they’ll tell you, and then build them into your notebook! Just remember to use your training when building these plots.


Recreate my Viz

One of the really awesome things about ggplot() is that we can very easily create complex plots because of its layered plotting functionality. This means that we can include lots of geometry layers in a single plot, helping us understand a variable or collection of variables even better than a single visual type would facilitate for.

You want a plot that shows the density, a boxplot, and individual observed values of a data set across several categories all at once? No problem! Try and re-create the plot below if you like, or try building something similar.


If you haven’t done so already, bookmark Ced Scherer’s detailed {ggplot2} tutorial. If you are interested in data visualization, I strongly recommend returning to this and building at least a small portfolio of data visualizations to share!