Descriptive Statistics with R and the Austin Housing Data

August 19, 2025

The Highlights

The summarize() Function
Summary Statistics for Numerical Variables
- Missing Values (NA) are Contagious
Using mutate() to construct new variables
Summary Statistics for Categorical Variables
Summary Statistics by Group with group_by() and summarize()

The `summarize()` Function

We can use the summarize() function to calculate summary statistics

data %>%
  summarize(
    name1 = expression,
    name2 = expression,
    ...
  )

We’ll be reminded of some summary statistics for numerical variables on the next slide

Summary Statistics for Numerical Variables

Measures of Center:

Mean: mean(col_name)
Median: median(col_name)

Measures of Spread:

Standard Deviation: sd(col_name)
Inter-Quartile Range: IQR(col_name)

Additional Measures:

Minimum: min(col_name)
Maximum: max(col_name)
Quantiles/Percentiles: quantile(col_name, probs)
- Note: probs is the argument for the percentile(s) you wish to obtain

Try It!

\(\bigstar\) Open our Day2to5_AustinHousingData.qmd file and find the answers to several of the questions we asked last time (posted to Slack). While our questions from last class were all about averages try using some of the other functions to gain deeper understanding of our numeric variables.

For Example:

austin %>%
  summarize(
    avg_lotSize = mean(lotSizeSqFt),
    median_lotSize = median(lotSizeSqFt),
    sd_lotSize = sd(lotSizeSqFt)
  )

min_lotSize	avg_lotSize	median_lotSize	max_lotSize	sd_lotSize
100	21957.36	8232	34154525	507857

While you are doing this, do “future you” a favor by keeping your notebook organized, each code chunk to one pipeline, and write descriptive text before each code chunk.

Warning: Missing Values are Contagious

Our Austin, TX Zillow data has no missing values.

What happens if we try to compute numerical summaries with a data set that contains missing values?

var1	var2
25.52538	16.38364
25.10974	13.62790
25.56339	16.67832
26.18156	16.50107
21.82637	16.16196
24.60080	NA
26.12753	11.01082
25.46633	10.21072
25.11542	17.28418
23.68327	NA

missing_df %>%
  summarize(
    avg_var1 = mean(var1),
    avg_var2 = mean(var2)
  )

avg_var1	avg_var2
24.91998	NA

Warning: Missing Values are Contagious

Our Austin, TX Zillow data has no missing values.

What happens if we try to compute numerical summaries with a data set that contains missing values?

var1	var2
25.52538	16.38364
25.10974	13.62790
25.56339	16.67832
26.18156	16.50107
21.82637	16.16196
24.60080	NA
26.12753	11.01082
25.46633	10.21072
25.11542	17.28418
23.68327	NA

missing_df %>%
  summarize(
    avg_var1 = mean(var1),
    avg_var2 = mean(var2)
  )

avg_var1	avg_var2
24.91998	NA

missing_df %>%
  summarize(
    avg_var1 = mean(var1),
    avg_var2 = mean(var2, na.rm = TRUE)
  )

avg_var1	avg_var2
24.91998	14.73233

Warning: Missing Values are Contagious

Our Austin, TX Zillow data has no missing values.

What happens if we try to compute numerical summaries with a data set that contains missing values?

var1	var2
25.52538	16.38364
25.10974	13.62790
25.56339	16.67832
26.18156	16.50107
21.82637	16.16196
24.60080	NA
26.12753	11.01082
25.46633	10.21072
25.11542	17.28418
23.68327	NA

missing_df %>%
  summarize(
    avg_var1 = mean(var1),
    avg_var2 = mean(var2)
  )

avg_var1	avg_var2
24.91998	NA

missing_df %>%
  summarize(
    avg_var1 = mean(var1),
    avg_var2 = mean(var2, na.rm = TRUE)
  )

avg_var1	avg_var2
24.91998	14.73233

We use na.rm = TRUE to remove missing values from the calculation

What if I Don’t Have the Variable I Want?

The data we have generally limits the questions we can ask and answer

Sometimes we can construct new variables from existing variables though

For Example: One of our questions from our previous class meeting was about the number of bathrooms per bedroom in a home. While we don’t have a bathrooms_per_br column, we can construct one using the numOfBathrooms and numOfBedrooms columns.

austin %>%
  mutate(bathrooms_per_br = numOfBathrooms / numOfBedrooms)

homeType	numOfBedrooms	numOfBathrooms	bathrooms_per_br
Single Family	2	3	1.5000000
Single Family	5	5	1.0000000
Multiple Occupancy	6	4	0.6666667
Single Family	1	1	1.0000000
Single Family	5	4	0.8000000
Condo	3	2	0.6666667

\(\bigstar\) Starting with the code to create bathrooms_per_br, compute the average number of bathrooms per bedroom for properties in our sample.

Summary Statistics for Categorical Variables

We generally summarize categorical variables with using counts

A frequency table shows raw counts of each category
- We use count(), which doesn’t require summarize() for this

For Example:

austin %>%
  count(city)

city	n
austin	7413
del valle	47
driftwood	6
dripping springs	3
manchaca	3
pflugerville	24
west lake hills	2

Summary Statistics for Categorical Variables

We generally summarize categorical variables with using counts

A frequency table shows raw counts of each category
- We use count(), which doesn’t require summarize() for this

A relative frequency table shows proportions for each category
- We need to mutate() relative frequencies after calculating the counts for each level

For Example:

austin %>%
  count(city) %>%
  mutate(rel_freq = n/sum(n))

city	n	rel_freq
austin	7413	0.9886636
del valle	47	0.0062683
driftwood	6	0.0008002
dripping springs	3	0.0004001
manchaca	3	0.0004001
pflugerville	24	0.0032009
west lake hills	2	0.0002667

Try It!

\(\bigstar\) Continue adding to your notebook by calculating frequencies and relative frequencies for several categorical variables. As a reminder, there were some variables in our data set that we said could be treated as numerical or categorical. For these variables, compare the insights you obtain from numerical summary statistics versus counts.

Again, do “future you” a favor by keeping your notebook organized and narrated.

Grouped Summaries

All of our techniques so far allow us to analyse a single variable at a time, across the entire data set.

What if we are interested in potential associations between variables?

We can use group_by() followed by summarise() to obtain one set of summary statistics per group

For Example: Is there an association between city and number of bedrooms (numOfBedrooms)?

austin %>%
  group_by(city) %>%
  summarize(
    avg_bedrooms = mean(numOfBedrooms)
  )

city	num_homes	avg_bedrooms
austin	7413	3.446918
del valle	47	3.319149
driftwood	6	4.333333
dripping springs	3	4.666667
manchaca	3	3.666667
pflugerville	24	3.458333
west lake hills	2	5.500000

Try It!

\(\bigstar\) Use group_by() and summarize() to answer some of our questions about associations between variables.

Warning: Make sure you are grouping by categorical variables – grouping by a numeric variable with lots of observed levels is ill-advised

As you’ve done, continue to do “future you” a favor by keeping your notebook organized and narrated.

Summary

We can use summarize() to compute summary statistics
- We pass summarize the names of the resulting columns and how they should be calculated
We can use count() to build a frequency table for levels of a categorical variable
- To compute relative frequencies, we use mutate(rel_freq = n/sum(n)) to build the relative frequency column
We can use group_by() and then summarize() to calculate summary statistics for groups defined by a categorical variable

Next Time: Data Visualization

Homework: Complete the Topic 4 notebook and submit the hash code using the Google Form at least 30 minutes before the start of Thursday’s class

Descriptive Statistics with R and the Austin Housing Data

The Highlights

The summarize() Function

Summary Statistics for Numerical Variables

Try It!

Warning: Missing Values are Contagious

Warning: Missing Values are Contagious

Warning: Missing Values are Contagious

What if I Don’t Have the Variable I Want?

Summary Statistics for Categorical Variables

Summary Statistics for Categorical Variables

Try It!

Grouped Summaries

Try It!

Summary

The `summarize()` Function