Purpose: This notebook addresses how to wrangle data in Python. This notebook assumes that you already have a Python installation that can be accessed from within RStudio, you know how to install and load Python modules, you know how to read data from a remote location using a web URL, and you know how to create Python code chunks from an RMarkdown notebook in RStudio.
Note: Below, you’ll be asked to build an Markdown notebook which includes many code cells to accomplish tasks. In addition to the code cells (grey background), be sure to leave yourself short notes in text cells (white background) which describe what you are doing with the code. Having a notebook that consists of 25 code cells with no additional context at the end of this discussion won’t be very useful. You’ll also be adding to this notebook later, so including some limited discussions now will be helpful!
File -> New File -> R Markdown...
to create a new R
Markdown file. Fill in the title field with
Data Wrangling for MLB Batted Balls
, update the author
field with your name if it is not already there. You can check the box
to use the current date when rendering your document, and leave the
default output format as HTML.You can see several items in this sample markdown file. We’ll cover them in detail later, but for now
You can remove all the boilerplate from line 12 down. In the initial R code chunk, add and run the following two lines.
import pandas as pd
.Now that you have {pandas}
loaded, the next thing we’ll
need is actual data to manipulate. We’ll be much more interested in
working with our own data which we’ll either find locally (stored on our
own computers) or remotely (accessed via a web URL).
{pandas}
includes several functions for reading data in a
variety of formats:
pd.read_csv("PATH_TO_FILE")
can be used to read data
from a comma separated values (csv) file. If you have a file which uses
another delimeter, you can pass that as a string to
pd.read_csv()
using the delimeter
argument.pd.read_***()
functions for
reading special files (excel, fixed width, html, json, parquet, etc.).
Note that if you are reading an excel file, you’ll need to pass the
sheet_name
as either a string or integer (sheet number)
since excel files are not “flat” files.The following examples show how we can read a variety of files into a Python session in RStudio.
#Read the MAT241 sheet from the grades.xls file in
#the Spring 2021 folder on my computer's desktop
grades = pd.read_excel("C:/Users/agilb/Desktop/Spring 2021/grades.xls", sheet_name = "MAT241")
#Read in data from a csv file of Tate Gallery Artists housed
#in a public github repository on the web
tate_artists = pd.read_csv("https://github.com/tategallery/collection/raw/master/artist_data.csv")
Now it’s time for you to load in some data. We’ll work with two data
sets in this notebook: battedballs.csv
and
park_dimensions.csv
. You can find those data sets in one of
my GitHub repositories here.
In order to read these into Python, we’ll need a link to the raw data
sources.
battedballs.csv
file.battedballs.csv
in the file listing in the
GitHub repository.Raw
appearing above the
data, and click it. The button is in the grey banner.pd.read_csv()
and don’t forget to store the
result into a named object: batted_balls
.ctrl+Enter
or
cmd+Return
, and then check the Environment
tab
of your top-right pane in RStudio to confirm that you have a new Data
object called batted_balls
.park_dimensions.csv
instead.Now that we’ve got data, the first thing we should do is look at it.
There are a few really handy R methods for getting a feel for
the data you have access to. The .info()
,
.describe()
, .head()
, .tail()
methods are four that are really commonly used.
+C
button at the top of the top-left pane in RStudio and
choosing Python as the language for the code cell. Use the functions
mentioned above to explore your batted_balls
and
park_dimensions
data frames.Now that we know how to load and view our data, let’s talk about manipulating it. We can restrict the data we are working with, produce summaries of the data, transform the data, and more.
Python methods can be chained together using “dot” notation. The object to the left of the “dot” will be passed as the first argument to the method on the right of the “dot”. As long as we are careful and pay attention to the state of a transformed object, we can chain multiple transformations together. We’ll with code that we’ve already seen.
You can read the code above as saying “take the
batted_balls
data set, and plug it into the
head()
function”.
The most common methods for restricting data deal with filtering out rows or columns so that we are only working with a subset of our original data set.
.loc[]
Sometimes we are not interested in all of the observations in a
particular data set, but only those satisfying certain criteria. For
example, maybe we only want to see batted_balls
where the
pitch_name
is "Changeup"
. The
filter()
function will allow us to get rid of all other
classes of pitch_type
.
We can also use more complex conditions on which rows to see using
and (&
) and or (|
)
statements. Maybe we want to see only those batted_balls
whose bearing
is "center"
and whose
launch_speed
exceeds 100mph.
Add another new code cell and use it to write and execute the filtering operations above. Verify that they work as expected.
Create a copy of the code to show batted_balls
whose
bearing
is "center"
and whose
launch_speed
exceeds 100mph, and change it so that you
obtain all batted_balls
whose bearing
is
"center"
or whose
launch_speed
exceeds 100mph.
Add another new code cell and perform at least one additional filtering operation that you are interested in.
Similarly to the way we can filter rows, we are able to select only
the columns we are interested in. We can use .loc[]
for
this as well, but if we are just subsetting columns then it is more
common to pass a list of column names which we are interested in. Both
of the options below have the same behavior.
batted_balls.loc[:, ["batter_team", "bb_type", "bearing", "inning", "pitch_mph", "launch_speed", "launch_angle", "is_home_run"]]
batted_balls[["batter_team", "bb_type", "bearing", "inning", "pitch_mph", "launch_speed", "launch_angle", "is_home_run"]]
We can also select all columns except certain ones by using
the .drop()
method and setting the drop
axis = 1
to indicate that we want to drop columns rather
than rows.
Open another new code cell to confirm that the code from the blocks above work as expected.
In an additional new code cell, try subsetting several columns on your own. Can you subset columns and then subset columns again? Try it! What happens? When does it work and when does it not?
We can combine filtering and selection operations using
.loc[]
.
There are lots of ways we can summarize our data. We can provide simple counts, compute averages, even build out our own summary functions.
We can start with a simple question like, how many batted balls
from each team are contained in this data set? To answer this, we
simply use the “dot” to pass the batted_balls
data frame
into the value_counts()
function, identifying the
batter_team
column as the column we wish to count.
The counts are displayed in alphabetical order by
batter_team
. We might be interested in the teams with the
most batted_balls
. We’ll do this by chaining the
.sort_values()
method at the end of the previous
command.
It’s probably more likely that we want to see teams with lots of
batted balls at the top of our list. We can set the
ascending
argument of sort_values()
to
False
to achieve this.
Let’s say we wanted to know how the distribution of
pitch_name
against each of the different teams
(batter_team
) when there are 2 outs
(outs_when_up
). This is a more complicated question. We
would first need to filter the data so that we are only considering
batted balls when the number of outs_when_up
is 2. Then we
would obtain value_counts()
on the list of columns we care
about ["home_team", pitch_name"]
.
Again, verify that the code above works as described using a new code cell.
Add another new code cell and compute summaries of categorical variables that you find interesting.
Summarizing categorical data is most often done with counts, but
we’ve got many more choices when we are working with numerical data. We
have several measures of center or spread that we could choose from – we
could even define our own metrics. Let’s say we wanted to know the
median launch_speed
across all of the batted balls in our
data set. We can use .median()
for this.
R and Python behave differently here. In Python you get a numeric
value, but in R you would get NA
back. R lets its users
know that missing data is present and calculations must manually
override this behavior by telling functions that NA
values
should be removed from the calculation.
max()
)
launch angle for a batted ball in the data set.If we want summaries of multiple columns at once, we can do this with
the .agg()
method.
batted_balls.agg(mean_pitch_mph = ("pitch_mph", "mean"),
std_pitch_mph = ("pitch_mph", "std"),
mean_launch_speed = ("launch_speed", "mean"),
std_launch_speed = ("launch_speed", "std")
)
It might be useful if we could get grouped summary statistics. Let’s
use .group_by()
to see how these measures vary across pitch
type (pitch_name
).
batted_balls.groupby("pitch_name").agg(mean_pitch_mph = ("pitch_mph", "mean"),
std_pitch_mph = ("pitch_mph", "std"),
mean_launch_speed = ("launch_speed", "mean"),
std_launch_speed = ("launch_speed", "std")
)
Often, you may be in a situation where you would like to create new
columns, using the existing columns. This can be done using the familiar
bracket notation ([]
).
In the batted_balls
data set, let’s add a computed
column for the vertical launch velocity of the batted ball. The vertical
launch velocity can be computed as launch_speed
*
sin(launch_angle*(pi/180))
. We’ll need the
numpy
library for this. Install and load it, then you’ll
have access to np.sin()
for computing the sine function for
an angle.
batted_balls["vertical_launch_velocity"] = batted_balls["launch_speed"]*np.sin(batted_balls["launch_angle"]*(np.pi/180))
Again, add a code cell to verify that the code above works as described.
Update your code to add another column for the
horizontal_launch_velocity
. It is computed the same way,
but uses cos()
instead of sin()
.
Note that what we’ve done has permanently altered our
batted_balls
data frame. We’ve added two new columns to
it.
One pretty common step in an analysis is to create a categorical
column from a variable which was originally numeric. I haven’t found a
very elegant way to do this in Python, but perhaps this is a great
use-case for {reticulate}
. In order to do this we can use
the case_when()
function in R. The structure of this
function is a collection of subsequent tests and their corresponding
results if true. It will be easier to see in an example.
batted_balls_r <- py$batted_balls
batted_balls_r <- batted_balls_r %>%
mutate(launch_speed_categorical = case_when(
(launch_speed > 100) ~ "Very Fast",
(launch_speed > 80) ~ "Fast",
(launch_speed > 65) ~ "Meh",
(launch_speed > 40) ~ "Kinda Slow",
(is.na(launch_speed)) ~ NA,
TRUE ~ "Was it even moving?"
))
Notice that, if you want these changes made permanent in R, you’ll
need to explicitly save the changes to the batted_balls_r
data frame. Once this is done, you can pass the results back to python
via r.batted_balls_r
in a Python code chunk.
It’s not always the case that our data all lives neatly in a single
table. Most often our data is split across multiple tables. This is the
case in our baseball scenario. Useful information is stored away in the
park_dimensions()
data set. Storing this data separately
reduces redundancy, reduces storage size, and improves reliability. If
we want to use park_dimensions
information in understanding
our batted_balls
data, then we can join the
park_dimensions
data onto our batted_balls
data. A join matches two tables using at least one shared
column – the resulting table includes columns from both of the
individual tables.
There are several different types of join, but we’ll almost always be able to get away with a left-join. This particular type of join begins with a table (on the left) and adds columns from another table (the table on the right) by matching on a shared column(s).
A left-join to add the park_dimensions
variables onto
the batted_balls
data appears below. In
pandas
, we perform joins with the
.merge()
method.
The left_on
and right_on
arguments are the
column matches. If matching on multiple columns, you can pass lists of
column names here – those lists should be the same length and the
matching columns should appear in the same positions within the
lists.
batted_balls
as well as the columns from
park_dimensions
. You can spot-check a few rows to ensure
that the correct park dimension data has been appended to each row.It is common to find a need to reshape your data. Some formats are
convenient for analysis, while other formats are convenient for
presentation or storage. The {tidyverse}
provides two
functions whose job it is to reshape data frames from wide to long
(pivot_longer()
) or from long to wide
(pivot_wider()
). The process of transforming a data frame
from wide format to long format is done by collapsing
several columns down to just two – one containing the old column names
and another containing the corresponding values under those old columns.
Let’s see that below on a simple data set that we’ll construct. For
context, let’s assume that the data set below contains growth
measurements in centimeters for plants placed in four different types of
fertilizer (A
- D
).
num_obs = 5
wide_fertilizer_data = pd.DataFrame({
"day" : range(num_obs),
"A" : np.random.uniform(0, 4, num_obs),
"B" : np.random.uniform(0, 4, num_obs),
"C" : np.random.uniform(0, 4, num_obs),
"D" : np.random.uniform(0, 4, num_obs)
})
day | A | B | C | D |
---|---|---|---|---|
0 | 1.8264057 | 1.317352 | 1.824590 | 0.4845877 |
1 | 3.6573500 | 1.724042 | 1.121196 | 3.0031416 |
2 | 0.0778902 | 3.660756 | 3.080697 | 0.1803989 |
3 | 0.8182959 | 3.341337 | 1.521192 | 2.8867121 |
4 | 1.6094134 | 3.816755 | 3.478010 | 2.7209922 |
In the simulated dataset above, the columns A
-
D
aren’t variables that are being measured. They are
levels of a categorical variable, fertilizer_type
.
In order to make this data tidy for analysis, we’ll need to
pivot those four fertilizer types into a new column, and collapse the
recorded growth values into another new column. This means that each
day
will be repeated four times…once for each of the four
fertilizer types. We should go from a data frame with five rows to one
with twenty (a longer data frame!). Let’s see that with
pivot_longer()
below.
long_fertilizer_data = wide_fertilizer_data.melt(id_vars = ["day"], var_name = "fertilizer_type", value_name = "growth")
day | fertilizer_type | growth |
---|---|---|
0 | A | 1.8264057 |
1 | A | 3.6573500 |
2 | A | 0.0778902 |
3 | A | 0.8182959 |
4 | A | 1.6094134 |
0 | B | 1.3173524 |
1 | B | 1.7240415 |
2 | B | 3.6607563 |
3 | B | 3.3413372 |
4 | B | 3.8167553 |
0 | C | 1.8245895 |
1 | C | 1.1211957 |
2 | C | 3.0806975 |
3 | C | 1.5211921 |
4 | C | 3.4780098 |
0 | D | 0.4845877 |
1 | D | 3.0031416 |
2 | D | 0.1803989 |
3 | D | 2.8867121 |
4 | D | 2.7209922 |
The longer data frame is more difficult for us to extract meaning from than the original wide data frame. However, this wide-format data frame is much better suited for analysis.
The pivot_wider()
function can be though of as the
inverse of the pivot_longer()
function. We can use it to
convert our long_fertilizer_data
back into wide
format. I’ve found pivot_wider()
to be really useful when I
want to display tables in my notebooks.
wide_again = long_fertilizer_data.pivot(index = ["day"], columns = ["fertilizer_type"], values = ["growth"])
growth A | B | C | D | |
---|---|---|---|---|
0 | 1.8264057 | 1.317352 | 1.824590 | 0.4845877 |
1 | 3.6573500 | 1.724042 | 1.121196 | 3.0031416 |
2 | 0.0778902 | 3.660756 | 3.080697 | 0.1803989 |
3 | 0.8182959 | 3.341337 | 1.521192 | 2.8867121 |
4 | 1.6094134 | 3.816755 | 3.478010 | 2.7209922 |
Being able to quickly reshape data using pivot_longer()
and pivot_wider()
is a skill that will serve you quite well
if you find yourself doing analysis and reporting.
There is much that we haven’t covered, but a few things definitely worth a quick mention are:
Many of the changes we’ve made to the batted_balls
data frame here are temporary. If we want to make any permanent changes,
we’ll need to make explicit use of the equals (=
) operator.
Some functions have an additional inplace
argument which
can be set to True
to make permanent changes, but not every
function has this. We overwrite the original object if we use the same
name (or if we use inplace = True
), but can use a different
name and keep the original object available in our Global Environment.
For example,
The table outputs we’ve obtained while running our code don’t
look excellent. There is an R package called {kableExtra}
which provides functionality to convert our tables into visually
pleasing components of our HTML document. Since this is an R package,
you would need {reticulate}
within an R code chunk to use
it. For example,
The code in the block above will produce a nicely formatted table in which the rows are striped and if we hover over a row of our table then it will be highlighted in the HTML document. Try adding a code cell as above and then re-knitting your document. See how nice this table looks? You might copy and paste these lines to the end of each of our code cells in this notebook that output tables. The document resulting from your knitting will be much more visually pleasing.
There is a lot more to learn about data manipulation and Python in
general. I like R much better for data wrangling than Python because of
the {tidyverse}
and the other package groups within the
tidy-ecosystem (ie. {tidytext}
,
{tidymodels}
, etc.). All of these packages are all built on
common syntax and programmatic principles, always working with data
frames. You can read more about this in the TidyTools
Manifesto. That being said, you can certainly complete all of your
data wrangling tasks in Python if you like.
You won’t be an expert after working through this document, but it should provide you with a solid start. Please feel free to continue adding notes to your markdown file as we encounter more advanced functionality.