{tidymodels} Overview

Dr. Gilbert

September 24, 2024

A Reminder About Our Goals

So…how do we do it?

The Highlights

  • Obtain training and validation data

  • Build a model using the {tidymodels} framework

    • Model specification (declaring the type of model)
    • Recipe (defining the response, features, and transformations)
    • Workflow (encapsulate the specification and recipe together)
    • Fit the workflow to training data
  • Assess model performance

    • Global performance metrics (training data)
    • Individual term-based assessments (training data)
    • We’ll add more assessment techniques later
  • Making predictions

Splitting Training and Test Data

set.seed(080724) #set seed for reproducibility

data_splits <- my_data %>% #begin with `my_data` data frame
  initial_split(prop = 0.75) #split into 75% train / 25% test

train <- training(data_splits) #collect training rows into data frame
test <- testing(data_splits) #collect validation rows into data frame

Make sure to always set a seed. It ensures that you obtain the same training and testing data each time you run your notebook/model.

\(\bigstar\) Let’s try it! \(\bigstar\)

  1. Open RStudio and your MAT300 project
  2. Open the Notebook you had been using to explore our AirBnB Europe data from two weeks ago.
  3. Save a new copy of it, perhaps named MyModelingNotebook.qmd.
  4. Keep the YAML Header (but update the title) and keep the first code chunk – the one that (i) loads your packages, (ii) reads in the AirBnB data, and (iii) cleans up the column names, but delete everything below it.
  5. Add code to load the {tidymodels} package into your notebook, if it is not already loaded.
  6. Adapt the code on this slide to split your airbnb data into training and test sets.

Build and Fit a Model

lin_reg_spec <- linear_reg() %>% #define model class
  set_engine("lm") #set fitting engine

#Set recipe, including response and predictors
#Predictors are separate by "+" signs
lin_reg_rec <- recipe(response ~ pred_1 + pred_2 + ... + pred_k, 
                      data = train)

lin_reg_wf <- workflow() %>% #create workflow
  add_model(lin_reg_spec) %>% #add the model specification
  add_recipe(lin_reg_rec) #add the recipe

#Fit the workflow to the training data
lin_reg_fit <- lin_reg_wf %>%
  fit(train)

\(\bigstar\) Let’s try it! \(\bigstar\)

  1. Construct a linear regression model specification
  2. Choose a few of the available (numeric) predictors of listing price for the airbnb rentals in your data set and construct a recipe to predict price using those predictors
  3. Package your model and recipe together into a workflow
  4. Fit the workflow to your training data

Global Model Assessment on Training Data

metric values
r.squared 0.7595231
adj.r.squared 0.7538648
sigma 3.1320329
statistic 134.2321602
p.value 0.0000000
df 4.0000000
logLik -445.5722346
AIC 903.1444693
BIC 922.1331851
deviance 1667.6371409
df.residual 170.0000000
nobs 175.0000000
#Begin with fitted model
#and then get global 
#performance metrics
lin_reg_fit %>% 
  glance() 

We’ll Focus On:

  • adj.r.squared
  • sigma
  • p.value
  • df.residual
  • nobs
  1. \(\bigstar\) Let’s try it! \(\bigstar\) Use glance() to view global model assessment metrics for your model

Term-Based Assessments

lin_reg_fit %>% #Begin with a fitted model
  extract_fit_engine() %>% #Obtain model fit information
  tidy() #Format as data frame
term estimate std.error statistic p.value
(Intercept) 33.884098 1.3207624 25.654954 0.0000000
displ -1.259214 0.5772141 -2.181536 0.0305166
cyl -1.516722 0.4182149 -3.626658 0.0003793
drv_f 5.089640 0.6514639 7.812621 0.0000000
drv_r 5.046396 0.8205130 6.150294 0.0000000

You can conduct inference and interpret model coefficients from here as well!

\(\bigstar\) Let’s try it! \(\bigstar\)

  1. Extract the term-based model assessment metrics from your fitted model object
  2. Pay special attention to the estimate, std.error, and p.value columns of the output – what can you use these for?

Making Predictions

We can obtain predictions using two methods, as long as the new_data you are predicting on includes all of the features from your recipe().


lin_reg_fit %>%
  predict(new_data)

To create a single column (.pred) data frame of predictions or columns of lower and upper interval bounds.

lin_reg_fit %>%
  augment(new_data)

To add a new .pred column to the new_data data frame.

Note: Your “new data” could be your training data, your testing data, some new real observations that you want to make predictions for, or even some counterfactual (hypothetical) data

Making Predictions

\(\bigstar\) Let’s try this! \(\bigstar\)

  1. Use predict() to make predictions for the rental prices of your training observations, then use augment() to do the same – what is the difference?
  2. Plot your model’s predictions with respect to one of your predictor variables.

If you are successful doing the above…

  1. Find the minimum (MIN_VAL) and maximum (MAX_VAL) observed training values for one of your numeric predictors
  2. Create a new_data data frame using tibble() and seq().
    • Include a column for each of the predictors your model uses – be sure to use the exact same names as in your training set
    • Choose a fixed value for all but your selected numerical variable (they could be different from one another)
    • For your chosen variable, use seq(MIN_VAL, MAX_VAL, length.out = 250)
  3. Use predict() and augment() to make price predictions for these fictitious (counterfactual) rentals
    • Can you get interval bounds rather than predictions using predict()?

Additional Tasks

Build a simpler model, one using just a single numerical predictor. Then, for this model, try each of the following…

  1. Plotting and interpreting model predictions

    1. A scatterplot of the original training data
    2. A lineplot showing the model’s predictions
    3. An interval band showing the confidence intervals for predictions (hint. use geom_ribbon())
    4. An interval band showng the prediction intervals for predictions
    5. Meaningful plot labels
    6. Interpret the resulting plot
  2. Do you have a preference for the order in which you include the geometry layers?

  3. Calculating and analysing training residuals

    1. Can you add a column of predictions to your training data set?
    2. Can you mutate a new column of prediction errors (residuals) onto your training data set?
    3. Can you plot a histogram and/or density plot of the residuals?
    4. Analyse the plot you just created – what do you notice? Is this what you expected?

Next Time…

A review of hypothesis testing and confidence intervals