{tidymodels}
OverviewSeptember 24, 2024
So…how do we do it?
Obtain training and validation data
Build a model using the {tidymodels}
framework
Assess model performance
Making predictions
set.seed(080724) #set seed for reproducibility
data_splits <- my_data %>% #begin with `my_data` data frame
initial_split(prop = 0.75) #split into 75% train / 25% test
train <- training(data_splits) #collect training rows into data frame
test <- testing(data_splits) #collect validation rows into data frame
Make sure to always set a seed. It ensures that you obtain the same training and testing data each time you run your notebook/model.
\(\bigstar\) Let’s try it! \(\bigstar\)
MyModelingNotebook.qmd
.{tidymodels}
package into your notebook, if it is not already loaded.airbnb
data into train
ing and test
sets.lin_reg_spec <- linear_reg() %>% #define model class
set_engine("lm") #set fitting engine
#Set recipe, including response and predictors
#Predictors are separate by "+" signs
lin_reg_rec <- recipe(response ~ pred_1 + pred_2 + ... + pred_k,
data = train)
lin_reg_wf <- workflow() %>% #create workflow
add_model(lin_reg_spec) %>% #add the model specification
add_recipe(lin_reg_rec) #add the recipe
#Fit the workflow to the training data
lin_reg_fit <- lin_reg_wf %>%
fit(train)
\(\bigstar\) Let’s try it! \(\bigstar\)
airbnb
rentals in your data set and construct a recipe to predict price using those predictorsmetric | values |
---|---|
r.squared | 0.7595231 |
adj.r.squared | 0.7538648 |
sigma | 3.1320329 |
statistic | 134.2321602 |
p.value | 0.0000000 |
df | 4.0000000 |
logLik | -445.5722346 |
AIC | 903.1444693 |
BIC | 922.1331851 |
deviance | 1667.6371409 |
df.residual | 170.0000000 |
nobs | 175.0000000 |
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 33.884098 | 1.3207624 | 25.654954 | 0.0000000 |
displ | -1.259214 | 0.5772141 | -2.181536 | 0.0305166 |
cyl | -1.516722 | 0.4182149 | -3.626658 | 0.0003793 |
drv_f | 5.089640 | 0.6514639 | 7.812621 | 0.0000000 |
drv_r | 5.046396 | 0.8205130 | 6.150294 | 0.0000000 |
You can conduct inference and interpret model coefficients from here as well!
\(\bigstar\) Let’s try it! \(\bigstar\)
estimate
, std.error
, and p.value
columns of the output – what can you use these for?We can obtain predictions using two methods, as long as the new_data
you are predicting on includes all of the features from your recipe()
.
To create a single column (.pred
) data frame of predictions or columns of lower and upper interval bounds.
Note: Your “new data” could be your training data, your testing data, some new real observations that you want to make predictions for, or even some counterfactual (hypothetical) data
\(\bigstar\) Let’s try this! \(\bigstar\)
predict()
to make predictions for the rental prices of your training observations, then use augment()
to do the same – what is the difference?If you are successful doing the above…
MIN_VAL
) and maximum (MAX_VAL
) observed training values for one of your numeric predictorsnew_data
data frame using tibble()
and seq()
.
train
ing setseq(MIN_VAL, MAX_VAL, length.out = 250)
predict()
and augment()
to make price predictions for these fictitious (counterfactual) rentals
predict()
?Build a simpler model, one using just a single numerical predictor. Then, for this model, try each of the following…
Plotting and interpreting model predictions
geom_ribbon()
)Do you have a preference for the order in which you include the geometry layers?
Calculating and analysing training residuals
A review of hypothesis testing and confidence intervals