Descriptive Statistics with R and the Austin Housing Data

January 2, 2026

The Highlights

  • The summarize() Function
  • Summary Statistics for Numerical Variables
    • Missing Values (NA) are Contagious
  • Using mutate() to construct new variables
  • Summary Statistics for Categorical Variables
  • Summary Statistics by Group with group_by() and summarize()

The summarize() Function

We can use the summarize() function to calculate summary statistics

data %>%
  summarize(
    name1 = expression,
    name2 = expression,
    ...
  )

We’ll be reminded of some summary statistics for numerical variables on the next slide

Summary Statistics for Numerical Variables

Measures of Center:

  • Mean: mean(col_name)
  • Median: median(col_name)

Measures of Spread:

  • Standard Deviation: sd(col_name)
  • Inter-Quartile Range: IQR(col_name)

Additional Measures:

  • Minimum: min(col_name)
  • Maximum: max(col_name)
  • Quantiles/Percentiles: quantile(col_name, probs)
    • Note: probs is the argument for the percentile(s) you wish to obtain

Try It!

\(\bigstar\) Open the Austin Zillow Quarto document you’ve been working on and find the answers to several of the questions we asked about numerical variables last time.

Try using more variety than just the mean() or median() functions.

A list of additional questions is posted to Slack.

For Example:

austin %>%
  summarize(
    avg_lotSize = mean(lotSizeSqFt),
    median_lotSize = median(lotSizeSqFt),
    sd_lotSize = sd(lotSizeSqFt)
  )
min_lotSize avg_lotSize median_lotSize max_lotSize sd_lotSize
100 21957.36 8232 34154525 507857

While you are doing this, do “future you” a favor by keeping your notebook organized, each code chunk to one pipeline, and write descriptive text before each code chunk.

Warning: Missing Values are Contagious

Our Austin, TX Zillow data has no missing values.

What happens if we try to compute numerical summaries with a data set that contains missing values?

var1 var2
23.38405 18.61620
25.11181 12.08665
21.86741 12.53071
20.17029 NA
21.25169 17.49857
21.35059 10.63031
26.47592 19.34314
29.09288 13.53561
24.39339 NA
25.32031 12.36794
missing_df %>%
  summarize(
    avg_var1 = mean(var1),
    avg_var2 = mean(var2)
  )
avg_var1 avg_var2
23.84183 NA

Warning: Missing Values are Contagious

Our Austin, TX Zillow data has no missing values.

What happens if we try to compute numerical summaries with a data set that contains missing values?

var1 var2
23.38405 18.61620
25.11181 12.08665
21.86741 12.53071
20.17029 NA
21.25169 17.49857
21.35059 10.63031
26.47592 19.34314
29.09288 13.53561
24.39339 NA
25.32031 12.36794
missing_df %>%
  summarize(
    avg_var1 = mean(var1),
    avg_var2 = mean(var2)
  )
avg_var1 avg_var2
23.84183 NA
missing_df %>%
  summarize(
    avg_var1 = mean(var1),
    avg_var2 = mean(var2, na.rm = TRUE)
  )
avg_var1 avg_var2
23.84183 14.57614

Warning: Missing Values are Contagious

Our Austin, TX Zillow data has no missing values.

What happens if we try to compute numerical summaries with a data set that contains missing values?

var1 var2
23.38405 18.61620
25.11181 12.08665
21.86741 12.53071
20.17029 NA
21.25169 17.49857
21.35059 10.63031
26.47592 19.34314
29.09288 13.53561
24.39339 NA
25.32031 12.36794
missing_df %>%
  summarize(
    avg_var1 = mean(var1),
    avg_var2 = mean(var2)
  )
avg_var1 avg_var2
23.84183 NA
missing_df %>%
  summarize(
    avg_var1 = mean(var1),
    avg_var2 = mean(var2, na.rm = TRUE)
  )
avg_var1 avg_var2
23.84183 14.57614

We use na.rm = TRUE to remove missing values from the calculation

What if I Don’t Have the Variable I Want?

The data we have generally limits the questions we can ask and answer

Sometimes we can construct new variables from existing variables though

For Example: One of our questions from our previous class meeting was about the number of bathrooms per bedroom in a home. While we don’t have a bathrooms_per_br column, we can construct one using the numOfBathrooms and numOfBedrooms columns.

austin %>%
  mutate(bathrooms_per_br = numOfBathrooms / numOfBedrooms)
homeType numOfBedrooms numOfBathrooms bathrooms_per_br
Single Family 2 3 1.5000000
Single Family 5 5 1.0000000
Multiple Occupancy 6 4 0.6666667
Single Family 1 1 1.0000000
Single Family 5 4 0.8000000
Condo 3 2 0.6666667

\(\bigstar\) Starting with the code to create bathrooms_per_br, compute the average number of bathrooms per bedroom for properties in our sample.

Summary Statistics for Categorical Variables

We generally summarize categorical variables with using counts

  • A frequency table shows raw counts of each category
    • We use count(), which doesn’t require summarize() for this

For Example:

austin %>%
  count(city)
city n
austin 7413
del valle 47
driftwood 6
dripping springs 3
manchaca 3
pflugerville 24
west lake hills 2

Summary Statistics for Categorical Variables

We generally summarize categorical variables with using counts

  • A frequency table shows raw counts of each category
    • We use count(), which doesn’t require summarize() for this
  • A relative frequency table shows proportions for each category
    • We need to mutate() relative frequencies after calculating the counts for each level

For Example:

austin %>%
  count(city) %>%
  mutate(rel_freq = n/sum(n))
city n rel_freq
austin 7413 0.9886636
del valle 47 0.0062683
driftwood 6 0.0008002
dripping springs 3 0.0004001
manchaca 3 0.0004001
pflugerville 24 0.0032009
west lake hills 2 0.0002667

Try It!

\(\bigstar\) Continue adding to your notebook by calculating frequencies and relative frequencies for several categorical variables. As a reminder, there were some variables in our data set that we said could be treated as numerical or categorical. For these variables, compare the insights you obtain from numerical summary statistics versus counts.

Again, do “future you” a favor by keeping your notebook organized and narrated.

Grouped Summaries

All of our techniques so far allow us to analyse a single variable at a time, across the entire data set.

What if we are interested in potential associations between variables?

We can use group_by() followed by summarise() to obtain one set of summary statistics per group

For Example: Is there an association between city and number of bedrooms (numOfBedrooms)?

austin %>%
  group_by(city) %>%
  summarize(
    avg_bedrooms = mean(numOfBedrooms)
  )
city num_homes avg_bedrooms
austin 7413 3.446918
del valle 47 3.319149
driftwood 6 4.333333
dripping springs 3 4.666667
manchaca 3 3.666667
pflugerville 24 3.458333
west lake hills 2 5.500000

Try It!

\(\bigstar\) Use group_by() and summarize() to answer some of our questions about associations between variables.

Warning: Make sure you are grouping by categorical variables – grouping by a numeric variable with lots of observed levels is ill-advised

As you’ve done, continue to do “future you” a favor by keeping your notebook organized and narrated.

Exit Ticket Task

Navigate to our MAT241 Exit Ticket Form, answer the questions, and complete the task below.


Note. Today’s discussion is listed as 3. Descriptive Statistics

Task: The variable measuring the number of bedrooms is one that we said we could treat as either numerical or categorical. It is also reasonable to expect that the number of bedrooms would vary by home type. Describe what you might do to use descriptive statistics to examine whether the number of bedrooms is associated with the home type.

Summary

  • We can use summarize() to compute summary statistics
    • We pass summarize the names of the resulting columns and how they should be calculated
  • We can use count() to build a frequency table for levels of a categorical variable
    • To compute relative frequencies, we use mutate(rel_freq = n/sum(n)) to build the relative frequency column
  • We can use group_by() and then summarize() to calculate summary statistics for groups defined by a categorical variable

Next Time: Data Visualization

Homework: Complete the Topic 4 notebook and submit the hash code using the Google Form at least 30 minutes before the start of our next class meeting.