Objectives: This notebook shows an example of a model being assessed and constructed using the framework proposed in notebook x_ModelingFrameworkAndTidymodelsReview. We’ll use the batted balls and home runs data that many of you have been working with for the last several class meetings. If you chose to use the FAA BirdStrikes and Engine Damage data, you should follow along and implement models with your data set.

The Data

As a reminder, the MLB batted balls data set was also associated with the park dimensions dataset. We’ll load both of those data sets below.

batted_balls <- read_csv("https://raw.githubusercontent.com/agmath/agmath.github.io/master/data/classification/battedballs.csv")
## Rows: 13000 Columns: 25
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (8): home_team, away_team, batter_team, batter_name, pitcher_name, bb_...
## dbl  (16): bip_id, batter_id, pitcher_id, is_batter_lefty, is_pitcher_lefty,...
## date  (1): game_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
park_dims <- read_csv("https://raw.githubusercontent.com/agmath/agmath.github.io/master/data/classification/park_dimensions.csv")
## Rows: 30 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): NAME, Cover
## dbl (7): park, LF_Dim, CF_Dim, RF_Dim, LF_W, CF_W, RF_W
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Now that the data sets have been read into our notebook, we can view the first few rows of each as a reminder of what we are working with. The first two rows of the batted_balls data frame is below.

batted_balls %>%
  head(n = 2) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
bip_id game_date home_team away_team batter_team batter_name pitcher_name batter_id pitcher_id is_batter_lefty is_pitcher_lefty bb_type bearing pitch_name park inning outs_when_up balls strikes plate_x plate_z pitch_mph launch_speed launch_angle is_home_run
61093 2020-09-11 TB BOS BOS bogaerts, xander snell, blake 593428 605483 0 1 ground_ball center Changeup 26 3 2 2 2 1.65 2.83 89.5 58.5 -47 0
117924 2020-08-31 CIN STL STL edman, tommy lorenzen, michael 669242 547179 1 0 ground_ball center Changeup 5 8 1 0 1 -0.97 1.82 86.4 NA -8 0

Similarly, the first few rows of park_dims is displayed next.

park_dims %>%
  head(n = 2) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
park NAME Cover LF_Dim CF_Dim RF_Dim LF_W CF_W RF_W
0 Chase Field Roof 328 407 335 8 25 8
1 SunTrust Park Outdoor 335 400 325 6 8 16

We can see that the two data frames share the park column. This means that we can join the two data sets together – adding information about the baseball stadium (including the field dimensions) to the batted_balls data frame. It is reasonable that knowing about the park dimensions will be helpful in identifying whether a batted ball is going to be a home run (hit over the outfield fence) or not.

batted_balls_parks <- batted_balls %>%
  left_join(park_dims,
            by = c("park" = "park"))

batted_balls_parks %>%
  head(2) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
bip_id game_date home_team away_team batter_team batter_name pitcher_name batter_id pitcher_id is_batter_lefty is_pitcher_lefty bb_type bearing pitch_name park inning outs_when_up balls strikes plate_x plate_z pitch_mph launch_speed launch_angle is_home_run NAME Cover LF_Dim CF_Dim RF_Dim LF_W CF_W RF_W
61093 2020-09-11 TB BOS BOS bogaerts, xander snell, blake 593428 605483 0 1 ground_ball center Changeup 26 3 2 2 2 1.65 2.83 89.5 58.5 -47 0 Tropicana Field Dome 315 404 322 11 9 11
117924 2020-08-31 CIN STL STL edman, tommy lorenzen, michael 669242 547179 1 0 ground_ball center Changeup 5 8 1 0 1 -0.97 1.82 86.4 NA -8 0 Great American Ballpark Outdoor 328 404 325 12 8 8

Success!

In this notebook, we’ll show how to assess, fit and use a model to classify whether a batted ball will result in a home run (is_home_run) given features of the scenario leading to the batted ball. To keep things simple, we’ll use pitch_mph, launch_speed, launch_angle, and Cover as predictors. (Note: There are some really great predictors that we’ve left on the table here.)

Splitting Data and Creating Folds

We’ll start by splitting our data into training and test sets and create folds using our training data.

batted_balls_parks <- batted_balls_parks %>%
  mutate(is_home_run = ifelse(is_home_run == 0, "no", "yes"))

set.seed(123)
bb_splits <- initial_split(batted_balls_parks, prop = 0.75)
train <- training(bb_splits)
test <- testing(bb_splits)

set.seed(456)
train_folds <- vfold_cv(train, v = 3)

Building Model Workflows

We’ll build workflows for a nearest neighbor model and a decision tree model.

#knn workflow
knn_spec <- nearest_neighbor() %>%
  set_engine("kknn") %>%
  set_mode("classification")

knn_rec <- recipe(is_home_run ~ pitch_mph + launch_speed + launch_angle + Cover, data = train) %>%
  step_impute_median(all_numeric_predictors()) %>%
  step_unknown(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors())

knn_wf <- workflow() %>%
  add_model(knn_spec) %>%
  add_recipe(knn_rec)

#decision tree workflow
dt_spec <- decision_tree() %>%
  set_engine("rpart") %>%
  set_mode("classification")

dt_rec <- recipe(is_home_run ~ pitch_mph + launch_speed + launch_angle + Cover, data = train) %>%
  step_impute_median(all_numeric_predictors()) %>%
  step_unknown(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors())

dt_wf <- workflow() %>%
  add_model(dt_spec) %>%
  add_recipe(dt_rec)

Evaluating Models Using Cross-Validation

We’ll now use cross-validation to evaluate our two workflows.

knn_cv_results <- knn_wf %>%
  fit_resamples(train_folds)

dt_cv_results <- dt_wf %>%
  fit_resamples(train_folds)

knn_cv_results %>%
  collect_metrics() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
.metric .estimator mean n std_err .config
accuracy binary 0.9503590 3 0.0026369 Preprocessor1_Model1
roc_auc binary 0.8082316 3 0.0128843 Preprocessor1_Model1
dt_cv_results %>%
  collect_metrics() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
.metric .estimator mean n std_err .config
accuracy binary 0.9586667 3 0.0017070 Preprocessor1_Model1
roc_auc binary 0.7378341 3 0.0414317 Preprocessor1_Model1

It looks like the nearest neighbor model outperforms the decision tree using ROC-AUC as the performance metric but not using accuracy. We’ll discuss more about model performance metrics at our next class meeting. For now, let’s fit the nearest neighbor model since both models had comparable accuracy but ROC-AUC was quite a bite better for nearest neighbors.

Fit, Assess, and Utilize Our “Best” Model

We’ll start by fitting our best model to our training data.

knn_fit <- knn_wf %>%
  fit(train)

Now that the model has been fitted, we’ll use our model to make predictions on our test observations. The performance metrics should align with our cross-validation estimate (if not, then we’ll have concerns which will prevent us from putting the model into production).

knn_fit %>%
  augment(test) %>%
  mutate(is_home_run = as.factor(is_home_run)) %>%
  accuracy(is_home_run, .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.946

The model was just about as accurate as expected given the cross-validation results. Since it doesn’t look like our model is overfit, let’s use it to make a prediction on whether a new batted ball would result in a home run.

new_hit <- tibble(
  pitch_mph = 84,
  launch_speed = 88,
  launch_angle = 47,
  Cover = "Outdoor"
)

new_hit %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
pitch_mph launch_speed launch_angle Cover
84 88 47 Outdoor
knn_fit %>%
  augment(new_hit) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
pitch_mph launch_speed launch_angle Cover .pred_class .pred_no .pred_yes
84 88 47 Outdoor no 1 0

Looks like our model is very certain that won’t be a home run.

Summary

There it is – we’ve walked through a basic modeling process using the {tidymodels} framework. We’ll discuss this in greater detail and introduce more advanced techniques throughout MAT434.


Extension: Hyperparameters and Tuning

The workflow in the example outlined above is a standard basic approach to modeling. We built two vanilla models (keeping all default parameters) and applied them to our home run prediction problem. Taking a closer look at these models, how sure are we that we have the “best” model? We identified that the KNN classifier outperformed the decision tree classifier, but that is for a specific value of \(k\) (nearest neighbors) and for an unconstrained decision tree. Even though we didn’t override any settings, we made modeling choices by accepting the default parameters.

These parameters which must be set prior to your model seeing its training data are called model hyperparameters. Each class of model has different hyperparameters available to it, and you’ll be introduced to them throughout our course. Ideally, we’d like the optimal settings for these hyperparameters to be learned from our data, but how do we make that happen if they need to be set prior to the model seeing any training data?

Keep Calm and Cross-Validate

The {tidymodels} ecosystem contains a few specialized functions for using cross-validation to search over combinations of hyperparameter settings. I’ll introduce one here: tune_grid().

The process of using tune_grid() is similar to that of fit() or fit_resamples(). We’ll pass the tune_grid() function a model workflow, but rather than setting static hyperparameter values like tree_depth = 6, we’ll explicitly state that we want to tune that hyperparameter by setting tree_depth = tune().

We’ll attempt to optimize our nearest_neighbor() classifier over its neighbors hyperparameter, and our decision_tree() classifier over its tree_depth hyperparameter. We’ll create the parameter grids below. These are essentially tibbles (data frames) of options for each hyperparameter. An important consideration when tuning hyperparameters is that, in the case where we attempt to optimize a single model class over multiple hyperparameters, we’ll need to build a tibble containing all combinations of hyperparameters in each list. Be careful with this because it can result in cross-validating over many, many models.

Note. You can set the grid parameter of tune_grid() to a number which will create a “space filling” grid containing that number of hyperparameter combinations if you are attempting to tune across lots of hyperparameters.

grid_neighbors <- tibble(
  "neighbors" = c(1, 3, 5, 7, 11, 15, 21, 41)
  )

grid_depth <- tibble(
  "tree_depth" = c(2, 3, 4, 5, 8, 10, 12, 15, 20)
  )

Now that we have the grids, we’ll re-create our model workflows, setting the relevant hyperparameters to tune(). Then we’ll pipe these workflows into tune_grid() and read the results.

knn_spec <- nearest_neighbor(neighbors = tune()) %>%
  set_engine("kknn") %>%
  set_mode("classification")

knn_rec <- recipe(is_home_run ~ pitch_mph + launch_speed + launch_angle + Cover, data = train) %>%
  step_impute_median(all_numeric_predictors()) %>%
  step_unknown(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors())

knn_wf <- workflow() %>%
  add_model(knn_spec) %>%
  add_recipe(knn_rec)

#decision tree workflow
dt_spec <- decision_tree(tree_depth = tune()) %>%
  set_engine("rpart") %>%
  set_mode("classification")

dt_rec <- recipe(is_home_run ~ pitch_mph + launch_speed + launch_angle + Cover, data = train) %>%
  step_impute_median(all_numeric_predictors()) %>%
  step_unknown(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors())

dt_wf <- workflow() %>%
  add_model(dt_spec) %>%
  add_recipe(dt_rec)

The tunable workflows are built, so now let’s tune our models over our hyperparameter grids.

knn_tune_results <- knn_wf %>%
  tune_grid(
    grid = grid_neighbors,
    resamples = train_folds
  )

dt_tune_results <- dt_wf %>%
  tune_grid(
    grid = grid_depth,
    resamples = train_folds
  )

Now that we’ve run our cross-validated grid search, let’s see the results for each.

knn_tune_results %>%
  collect_metrics() %>%
  filter(.metric == "accuracy") %>%
  select(neighbors, mean, std_err) %>%
  arrange(-mean) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
neighbors mean std_err
21 0.9582564 0.0019568
15 0.9574359 0.0023856
11 0.9565128 0.0026726
41 0.9558974 0.0022069
7 0.9513846 0.0024422
5 0.9503590 0.0026369
1 0.9298462 0.0020944
3 0.9298462 0.0018800

It looks like our best-performing nearest neighbor model is with 21 nearest neighbors, although 15 and 11 are not far behind.These models both sit at above 95.5% accuracy. It may be worth testing more options near 21 neighbors, since the closest options we considered were 15 and 41 – perhaps another value will be optimal.

Now let’s look at the results from our decision tree classifier.

dt_tune_results %>%
  collect_metrics() %>%
  filter(.metric == "accuracy") %>%
  select(tree_depth, mean, std_err) %>%
  arrange(-mean) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))
tree_depth mean std_err
5 0.9586667 0.0017070
8 0.9586667 0.0017070
10 0.9586667 0.0017070
12 0.9586667 0.0017070
15 0.9586667 0.0017070
20 0.9586667 0.0017070
4 0.9579487 0.0019568
3 0.9563077 0.0027919
2 0.9557949 0.0022842

Check this out! It looks like our optimized decision tree classifier (max_depth = 5) beats our best KNN classifier. The accuracy for this model is nearly 95.9% and has a standard deviation in fold accuracies comparable to the KNN models. Notice also that trees of depth 5, 8, 10, 12, 15, and 20 all perform similarly. This indicates that additional flexibility for our trees doesn’t improve model performance. The estimated model performance and standard error don’t change across these values, which indicates that there are some Pit of Success principles operating behind the scenes here, preventing our models from growing beyond a depth of 5.

Hyperparameter tuning helps improve your models on the margins, but sometimes these improvements can be quite significant.