Descriptive Statistics with R and the Austin Housing Data

September 16, 2024

The Highlights

  • The summarize() Function
  • Summary Statistics for Numerical Variables
    • Missing Values (NA) are Contagious
  • Using mutate() to construct new variables
  • Summary Statistics for Categorical Variables
  • Summary Statistics by Group with group_by() and summarize()

The summarize() Function

We can use the summarize() function to calculate summary statistics

data %>%
  summarize(
    name1 = expression,
    name2 = expression,
    ...
  )

We’ll be reminded of some summary statistics for numerical variables on the next slide

Summary Statistics for Numerical Variables

Measures of Center:

  • Mean: mean(col_name)
  • Median: median(col_name)

Measures of Spread:

  • Standard Deviation: sd(col_name)
  • Inter-Quartile Range: IQR(col_name)

Additional Measures:

  • Minimum: min(col_name)
  • Maximum: max(col_name)
  • Quantiles/Percentiles: quantile(col_name, probs)
    • Note: probs is the argument for the percentile(s) you wish to obtain

Try It!

\(\bigstar\) Open our Day2to5_AustinHousingData.qmd file and find the answers to several of the questions we asked last time (posted to Slack). While our questions from last class were all about averages try using some of the other functions to gain deeper understanding of our numeric variables.

For Example:

austin %>%
  summarize(
    avg_lotSize = mean(lotSizeSqFt),
    median_lotSize = median(lotSizeSqFt),
    sd_lotSize = sd(lotSizeSqFt)
  )
min_lotSize avg_lotSize median_lotSize max_lotSize sd_lotSize
100 21957.36 8232 34154525 507857

While you are doing this, do “future you” a favor by keeping your notebook organized, each code chunk to one pipeline, and write descriptive text before each code chunk.

05:00

Warning: Missing Values are Contagious

Our Austin, TX Zillow data has no missing values.

What happens if we try to compute numerical summaries with a data set that contains missing values?

var1 var2
23.29265 10.83647
26.66060 15.20252
26.13309 NA
25.70069 13.52328
19.07883 15.46630
27.74035 14.41281
25.17816 17.00389
25.42456 14.53529
28.95521 13.66709
31.78997 NA
missing_df %>%
  summarize(
    avg_var1 = mean(var1),
    avg_var2 = mean(var2)
  )
avg_var1 avg_var2
25.99541 NA

Warning: Missing Values are Contagious

Our Austin, TX Zillow data has no missing values.

What happens if we try to compute numerical summaries with a data set that contains missing values?

var1 var2
23.29265 10.83647
26.66060 15.20252
26.13309 NA
25.70069 13.52328
19.07883 15.46630
27.74035 14.41281
25.17816 17.00389
25.42456 14.53529
28.95521 13.66709
31.78997 NA
missing_df %>%
  summarize(
    avg_var1 = mean(var1),
    avg_var2 = mean(var2)
  )
avg_var1 avg_var2
25.99541 NA
missing_df %>%
  summarize(
    avg_var1 = mean(var1),
    avg_var2 = mean(var2, na.rm = TRUE)
  )
avg_var1 avg_var2
25.99541 14.33096

Warning: Missing Values are Contagious

Our Austin, TX Zillow data has no missing values.

What happens if we try to compute numerical summaries with a data set that contains missing values?

var1 var2
23.29265 10.83647
26.66060 15.20252
26.13309 NA
25.70069 13.52328
19.07883 15.46630
27.74035 14.41281
25.17816 17.00389
25.42456 14.53529
28.95521 13.66709
31.78997 NA
missing_df %>%
  summarize(
    avg_var1 = mean(var1),
    avg_var2 = mean(var2)
  )
avg_var1 avg_var2
25.99541 NA
missing_df %>%
  summarize(
    avg_var1 = mean(var1),
    avg_var2 = mean(var2, na.rm = TRUE)
  )
avg_var1 avg_var2
25.99541 14.33096

We use na.rm = TRUE to remove missing values from the calculation

What if I Don’t Have the Variable I Want?

The data we have generally limits the questions we can ask and answer

Sometimes we can construct new variables from existing variables though

For Example: One of our questions from our previous class meeting was about the number of bathrooms per bedroom in a home. While we don’t have a bathrooms_per_br column, we can construct one using the numOfBathrooms and numOfBedrooms columns.

austin %>%
  mutate(bathrooms_per_br = numOfBathrooms / numOfBedrooms)
homeType numOfBedrooms numOfBathrooms bathrooms_per_br
Single Family 2 3 1.5000000
Single Family 5 5 1.0000000
Multiple Occupancy 6 4 0.6666667
Single Family 1 1 1.0000000
Single Family 5 4 0.8000000
Condo 3 2 0.6666667

\(\bigstar\) Starting with the code to create bathrooms_per_br, compute the average number of bathrooms per bedroom for properties in our sample.

Summary Statistics for Categorical Variables

We generally summarize categorical variables with using counts

  • A frequency table shows raw counts of each category
    • We use count(), which doesn’t require summarize() for this

For Example:

austin %>%
  count(city)
city n
austin 7413
del valle 47
driftwood 6
dripping springs 3
manchaca 3
pflugerville 24
west lake hills 2

Summary Statistics for Categorical Variables

We generally summarize categorical variables with using counts

  • A frequency table shows raw counts of each category
    • We use count(), which doesn’t require summarize() for this
  • A relative frequency table shows proportions for each category
    • We need to mutate() relative frequencies after calculating the counts for each level

For Example:

austin %>%
  count(city) %>%
  mutate(rel_freq = n/sum(n))
city n rel_freq
austin 7413 0.9886636
del valle 47 0.0062683
driftwood 6 0.0008002
dripping springs 3 0.0004001
manchaca 3 0.0004001
pflugerville 24 0.0032009
west lake hills 2 0.0002667

Try It!

\(\bigstar\) Continue adding to your notebook by calculating frequencies and relative frequencies for several categorical variables. As a reminder, there were some variables in our data set that we said could be treated as numerical or categorical. For these variables, compare the insights you obtain from numerical summary statistics versus counts.

Again, do “future you” a favor by keeping your notebook organized and narrated.

07:00

Grouped Summaries

All of our techniques so far allow us to analyse a single variable at a time, across the entire data set.

What if we are interested in potential associations between variables?

We can use group_by() followed by summarise() to obtain one set of summary statistics per group

For Example: Is there an association between city and number of bedrooms (numOfBedrooms)?

austin %>%
  group_by(city) %>%
  summarize(
    avg_bedrooms = mean(numOfBedrooms)
  )
city num_homes avg_bedrooms
austin 7413 3.446918
del valle 47 3.319149
driftwood 6 4.333333
dripping springs 3 4.666667
manchaca 3 3.666667
pflugerville 24 3.458333
west lake hills 2 5.500000

Try It!

\(\bigstar\) Use group_by() and summarize() to answer some of our questions about associations between variables.

Warning: Make sure you are grouping by categorical variables – grouping by a numeric variable with lots of observed levels is ill-advised

As you’ve done, continue to do “future you” a favor by keeping your notebook organized and narrated.

10:00

Summary

  • We can use summarize() to compute summary statistics
    • We pass summarize the names of the resulting columns and how they should be calculated
  • We can use count() to build a frequency table for levels of a categorical variable
    • To compute relative frequencies, we use mutate(rel_freq = n/sum(n)) to build the relative frequency column
  • We can use group_by() and then summarize() to calculate summary statistics for groups defined by a categorical variable

Next Time: Data Visualization

Homework: Complete the Topic 4 notebook and submit the hash code using the Google Form at least 30 minutes before the start of Thursday’s class