{tidymodels} OverviewSeptember 24, 2024
So…how do we do it?
Obtain training and validation data
Build a model using the {tidymodels} framework
Assess model performance
Making predictions
set.seed(080724) #set seed for reproducibility
data_splits <- my_data %>% #begin with `my_data` data frame
initial_split(prop = 0.75) #split into 75% train / 25% test
train <- training(data_splits) #collect training rows into data frame
test <- testing(data_splits) #collect validation rows into data frameMake sure to always set a seed. It ensures that you obtain the same training and testing data each time you run your notebook/model.
\(\bigstar\) Let’s try it! \(\bigstar\)
MyModelingNotebook.qmd.{tidymodels} package into your notebook, if it is not already loaded.airbnb data into training and test sets.lin_reg_spec <- linear_reg() %>% #define model class
set_engine("lm") #set fitting engine
#Set recipe, including response and predictors
#Predictors are separate by "+" signs
lin_reg_rec <- recipe(response ~ pred_1 + pred_2 + ... + pred_k,
data = train)
lin_reg_wf <- workflow() %>% #create workflow
add_model(lin_reg_spec) %>% #add the model specification
add_recipe(lin_reg_rec) #add the recipe
#Fit the workflow to the training data
lin_reg_fit <- lin_reg_wf %>%
fit(train)\(\bigstar\) Let’s try it! \(\bigstar\)
airbnb rentals in your data set and construct a recipe to predict price using those predictors| metric | values |
|---|---|
| r.squared | 0.7595231 |
| adj.r.squared | 0.7538648 |
| sigma | 3.1320329 |
| statistic | 134.2321602 |
| p.value | 0.0000000 |
| df | 4.0000000 |
| logLik | -445.5722346 |
| AIC | 903.1444693 |
| BIC | 922.1331851 |
| deviance | 1667.6371409 |
| df.residual | 170.0000000 |
| nobs | 175.0000000 |
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 33.884098 | 1.3207624 | 25.654954 | 0.0000000 |
| displ | -1.259214 | 0.5772141 | -2.181536 | 0.0305166 |
| cyl | -1.516722 | 0.4182149 | -3.626658 | 0.0003793 |
| drv_f | 5.089640 | 0.6514639 | 7.812621 | 0.0000000 |
| drv_r | 5.046396 | 0.8205130 | 6.150294 | 0.0000000 |
You can conduct inference and interpret model coefficients from here as well!
\(\bigstar\) Let’s try it! \(\bigstar\)
estimate, std.error, and p.value columns of the output – what can you use these for?We can obtain predictions using two methods, as long as the new_data you are predicting on includes all of the features from your recipe().
To create a single column (.pred) data frame of predictions or columns of lower and upper interval bounds.
Note: Your “new data” could be your training data, your testing data, some new real observations that you want to make predictions for, or even some counterfactual (hypothetical) data
\(\bigstar\) Let’s try this! \(\bigstar\)
predict() to make predictions for the rental prices of your training observations, then use augment() to do the same – what is the difference?If you are successful doing the above…
MIN_VAL) and maximum (MAX_VAL) observed training values for one of your numeric predictorsnew_data data frame using tibble() and seq().
training setseq(MIN_VAL, MAX_VAL, length.out = 250)predict() and augment() to make price predictions for these fictitious (counterfactual) rentals
predict()?Build a simpler model, one using just a single numerical predictor. Then, for this model, try each of the following…
Plotting and interpreting model predictions
geom_ribbon())Do you have a preference for the order in which you include the geometry layers?
Calculating and analysing training residuals
A review of hypothesis testing and confidence intervals