January 16, 2025
summarize()
FunctionNA
) are Contagiousmutate()
to construct new variablesgroup_by()
and summarize()
summarize()
FunctionWe can use the summarize()
function to calculate summary statistics
We’ll be reminded of some summary statistics for numerical variables on the next slide
Measures of Center:
mean(col_name)
median(col_name)
Measures of Spread:
sd(col_name)
IQR(col_name)
Additional Measures:
min(col_name)
max(col_name)
quantile(col_name, probs)
probs
is the argument for the percentile(s) you wish to obtain\(\bigstar\) Open our Day2to5_AustinHousingData.qmd
file and find the answers to several of the questions we asked last time (posted to Slack). While our questions from last class were all about averages try using some of the other functions to gain deeper understanding of our numeric variables.
For Example:
min_lotSize | avg_lotSize | median_lotSize | max_lotSize | sd_lotSize |
---|---|---|---|---|
100 | 21957.36 | 8232 | 34154525 | 507857 |
While you are doing this, do “future you” a favor by keeping your notebook organized, each code chunk to one pipeline, and write descriptive text before each code chunk.
Our Austin, TX Zillow data has no missing values.
What happens if we try to compute numerical summaries with a data set that contains missing values?
Our Austin, TX Zillow data has no missing values.
What happens if we try to compute numerical summaries with a data set that contains missing values?
var1 | var2 |
---|---|
30.06603 | 16.70668 |
28.09459 | 13.91630 |
24.13013 | 15.35275 |
21.47366 | 12.91570 |
24.86587 | NA |
25.55923 | 14.67098 |
26.28593 | 18.33346 |
25.84654 | NA |
27.49840 | 10.47783 |
23.72113 | 12.31492 |
Our Austin, TX Zillow data has no missing values.
What happens if we try to compute numerical summaries with a data set that contains missing values?
var1 | var2 |
---|---|
30.06603 | 16.70668 |
28.09459 | 13.91630 |
24.13013 | 15.35275 |
21.47366 | 12.91570 |
24.86587 | NA |
25.55923 | 14.67098 |
26.28593 | 18.33346 |
25.84654 | NA |
27.49840 | 10.47783 |
23.72113 | 12.31492 |
The data we have generally limits the questions we can ask and answer
Sometimes we can construct new variables from existing variables though
For Example: One of our questions from our previous class meeting was about the number of bathrooms per bedroom in a home. While we don’t have a bathrooms_per_br
column, we can construct one using the numOfBathrooms
and numOfBedrooms
columns.
homeType | numOfBedrooms | numOfBathrooms | bathrooms_per_br |
---|---|---|---|
Single Family | 2 | 3 | 1.5000000 |
Single Family | 5 | 5 | 1.0000000 |
Multiple Occupancy | 6 | 4 | 0.6666667 |
Single Family | 1 | 1 | 1.0000000 |
Single Family | 5 | 4 | 0.8000000 |
Condo | 3 | 2 | 0.6666667 |
\(\bigstar\) Starting with the code to create bathrooms_per_br
, compute the average number of bathrooms per bedroom for properties in our sample.
We generally summarize categorical variables with using counts
count()
, which doesn’t require summarize()
for thisWe generally summarize categorical variables with using counts
count()
, which doesn’t require summarize()
for thismutate()
relative frequencies after calculating the counts for each level\(\bigstar\) Continue adding to your notebook by calculating frequencies and relative frequencies for several categorical variables. As a reminder, there were some variables in our data set that we said could be treated as numerical or categorical. For these variables, compare the insights you obtain from numerical summary statistics versus counts.
Again, do “future you” a favor by keeping your notebook organized and narrated.
All of our techniques so far allow us to analyse a single variable at a time, across the entire data set.
What if we are interested in potential associations between variables?
We can use group_by()
followed by summarise()
to obtain one set of summary statistics per group
For Example: Is there an association between city
and number of bedrooms (numOfBedrooms
)?
\(\bigstar\) Use group_by()
and summarize()
to answer some of our questions about associations between variables.
Warning: Make sure you are grouping by categorical variables – grouping by a numeric variable with lots of observed levels is ill-advised
As you’ve done, continue to do “future you” a favor by keeping your notebook organized and narrated.
summarize()
to compute summary statistics
count()
to build a frequency table for levels of a categorical variable
mutate(rel_freq = n/sum(n))
to build the relative frequency columngroup_by()
and then summarize()
to calculate summary statistics for groups defined by a categorical variableNext Time: Data Visualization
Homework: Complete the Topic 4 notebook and submit the hash code using the Google Form at least 30 minutes before the start of Thursday’s class