{sklearn} Framework Example Model Construction`

The Data

As a reminder, the MLB batted balls data set was also associated with the park dimensions dataset. We’ll load both of those data sets below.

batted_balls = pd.read_csv("https://raw.githubusercontent.com/agmath/agmath.github.io/master/data/classification/battedballs.csv")

park_dims = pd.read_csv("https://raw.githubusercontent.com/agmath/agmath.github.io/master/data/classification/park_dimensions.csv")

Now that the data sets have been read into our notebook, we can view the first few rows of each as a reminder of what we are working with. The first two rows of the batted_balls data frame is below.

py$batted_balls %>%
  head(n = 2) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))

bip_id	game_date	home_team	away_team	batter_team	batter_name	pitcher_name	batter_id	pitcher_id	is_batter_lefty	is_pitcher_lefty	bb_type	bearing	pitch_name	park	inning	outs_when_up	balls	strikes	plate_x	plate_z	pitch_mph	launch_speed	launch_angle	is_home_run
61093	2020-09-11	TB	BOS	BOS	bogaerts, xander	snell, blake	593428	605483	0	1	ground_ball	center	Changeup	26	3	2	2	2	1.65	2.83	89.5	58.5	-47	0
117924	2020-08-31	CIN	STL	STL	edman, tommy	lorenzen, michael	669242	547179	1	0	ground_ball	center	Changeup	5	8	1	0	1	-0.97	1.82	86.4	NaN	-8	0

Similarly, the first few rows of park_dims is displayed next.

py$park_dims %>%
  head(n = 2) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))

park	NAME	Cover	LF_Dim	CF_Dim	RF_Dim	LF_W	CF_W	RF_W
0	Chase Field	Roof	328	407	335	8	25	8
1	SunTrust Park	Outdoor	335	400	325	6	8	16

We can see that the two data frames share the park column. This means that we can join the two data sets together – adding information about the baseball stadium (including the field dimensions) to the batted_balls data frame. It is reasonable that knowing about the park dimensions will be helpful in identifying whether a batted ball is going to be a home run (hit over the outfield fence) or not.

batted_balls_parks = batted_balls.merge(park_dims, how = "left", left_on = "park", right_on = "park")

py$batted_balls_parks %>%
  head(2) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))

bip_id	game_date	home_team	away_team	batter_team	batter_name	pitcher_name	batter_id	pitcher_id	is_batter_lefty	is_pitcher_lefty	bb_type	bearing	pitch_name	park	inning	outs_when_up	balls	strikes	plate_x	plate_z	pitch_mph	launch_speed	launch_angle	is_home_run	NAME	Cover	LF_Dim	CF_Dim	RF_Dim	LF_W	CF_W	RF_W
61093	2020-09-11	TB	BOS	BOS	bogaerts, xander	snell, blake	593428	605483	0	1	ground_ball	center	Changeup	26	3	2	2	2	1.65	2.83	89.5	58.5	-47	0	Tropicana Field	Dome	315	404	322	11	9	11
117924	2020-08-31	CIN	STL	STL	edman, tommy	lorenzen, michael	669242	547179	1	0	ground_ball	center	Changeup	5	8	1	0	1	-0.97	1.82	86.4	NaN	-8	0	Great American Ballpark	Outdoor	328	404	325	12	8	8

Success!

In this notebook, we’ll show how to assess, fit and use a model to classify whether a batted ball will result in a home run (is_home_run) given features of the scenario leading to the batted ball. To keep things simple, we’ll use pitch_mph, launch_speed, launch_angle, and Cover as predictors.

model_data = batted_balls_parks[["pitch_mph", "launch_speed", "launch_angle", "Cover", "is_home_run"]]

py$model_data %>%
  head() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))

pitch_mph	launch_speed	launch_angle	Cover
89.5	58.5	-47	Dome
86.4	NaN	-8	Outdoor
89.7	71.9	NaN	Outdoor
89.7	71.5	10	Outdoor
82.2	97.2	23	Outdoor
97.4	NaN	54	Outdoor

Note: There are some really great predictors that we’ve left on the table here.

Splitting Data

We’ll start by splitting our data into training and test sets. We’ll then break apart our features and response variable for each set.

train, test = train_test_split(model_data, train_size = 0.75, random_state = 434)

X_train = train.drop("is_home_run", axis = 1)
y_train = train["is_home_run"]
X_test = test.drop("is_home_run", axis = 1)
y_test = test["is_home_run"]

Building Modeling Pipelines

We’ll build pipelines for a nearest neighbor model and a decision tree model. Let’s start by just defining the model constructors.

knn_clf = KNeighborsClassifier()
dt_clf = DecisionTreeClassifier()

Now that we have those, let’s build a data transformation Pipeline(). The nearest neighbors classifier is distance-based, so we’ll need to scale any numeric features using either StandardScaler(), MinMaxScaler(), or some other scaling transformation. The decision tree classifier does not need to have scaled numeric features. We’ll need to one-hot encode the categorical features for each though.

#knn workflow
num_cols = ["pitch_mph", "launch_speed", "launch_angle"]
cat_cols = ["Cover"]

num_pipe_knn = Pipeline([
  ("num_imputer", SimpleImputer(strategy = "median")),
  ("norm", StandardScaler())
])
num_pipe_dt = Pipeline([
  ("num_imputer", SimpleImputer(strategy = "median"))
])
cat_pipe = Pipeline([
  ("cat_imputer", SimpleImputer(strategy = "most_frequent")),
  ("one-hot", OneHotEncoder())
])

preprocessor_knn = ColumnTransformer([
  ("num_cols", num_pipe_knn, num_cols),
  ("cat_cols", cat_pipe, cat_cols)
])
preprocessor_dt = ColumnTransformer([
  ("num_cols", num_pipe_dt, num_cols),
  ("cat_cols", cat_pipe, cat_cols)
])

pipe_knn = Pipeline([
  ("preprocessor", preprocessor_knn),
  ("model", knn_clf)
])
pipe_dt = Pipeline([
  ("preprocessor", preprocessor_dt),
  ("model", dt_clf)
])

Now that we’ve got our pipelines set up, we are ready to score them using cross validation.

Evaluating Models Using Cross-Validation

We’ll now use cross-validation to evaluate our two modeling pipelines.

cv_accuracy_knn = cross_val_score(pipe_knn, X_train, y_train, cv = 10, scoring = "accuracy")
cv_accuracy_dt = cross_val_score(pipe_dt, X_train, y_train, cv = 10, scoring = "accuracy")
cv_rocauc_knn = cross_val_score(pipe_knn, X_train, y_train, cv = 10, scoring = "roc_auc")
cv_rocauc_dt = cross_val_score(pipe_dt, X_train, y_train, cv = 10, scoring = "roc_auc")

print("CV Accuracy (knn): ", cv_accuracy_knn.mean(), " CV ROC_AUC (knn): ", cv_rocauc_knn.mean())

## CV Accuracy (knn):  0.9539487179487178  CV ROC_AUC (knn):  0.8215945497238988

print("CV Accuracy (dt): ", cv_accuracy_dt.mean(), " CV ROC_AUC (dt): ", cv_rocauc_dt.mean())

## CV Accuracy (dt):  0.9400000000000001  CV ROC_AUC (dt):  0.7050942221710408

It looks like the nearest neighbor model outperforms the decision tree using both of our performance metrics. We’ll discuss more about model performance metrics at our next class meeting. For now, let’s fit the nearest neighbor model since it seems superior.

Fit, Assess, and Utilize Our “Best” Model

We’ll start by fitting our best model to our training data.

pipe_knn.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num_cols',
                                                  Pipeline(steps=[('num_imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('norm',
                                                                   StandardScaler())]),
                                                  ['pitch_mph', 'launch_speed',
                                                   'launch_angle']),
                                                 ('cat_cols',
                                                  Pipeline(steps=[('cat_imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('one-hot',
                                                                   OneHotEncoder())]),
                                                  ['Cover'])])),
                ('model', KNeighborsClassifier())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num_cols',
                                                  Pipeline(steps=[('num_imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('norm',
                                                                   StandardScaler())]),
                                                  ['pitch_mph', 'launch_speed',
                                                   'launch_angle']),
                                                 ('cat_cols',
                                                  Pipeline(steps=[('cat_imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('one-hot',
                                                                   OneHotEncoder())]),
                                                  ['Cover'])])),
                ('model', KNeighborsClassifier())])

preprocessor: ColumnTransformer

ColumnTransformer(transformers=[('num_cols',
                                 Pipeline(steps=[('num_imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('norm', StandardScaler())]),
                                 ['pitch_mph', 'launch_speed', 'launch_angle']),
                                ('cat_cols',
                                 Pipeline(steps=[('cat_imputer',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('one-hot', OneHotEncoder())]),
                                 ['Cover'])])

num_cols

['pitch_mph', 'launch_speed', 'launch_angle']

SimpleImputer

SimpleImputer(strategy='median')

StandardScaler

StandardScaler()

cat_cols

['Cover']

SimpleImputer

SimpleImputer(strategy='most_frequent')

OneHotEncoder

OneHotEncoder()

KNeighborsClassifier

KNeighborsClassifier()

Now that the model has been fitted, we’ll use our model to make predictions on our test observations. The performance metrics should align with our cross-validation estimate (if not, then we’ll have concerns which will prevent us from putting the model into production).

test_preds = pipe_knn.predict(X_test)
accuracy_score(y_test, test_preds)

## 0.9578461538461538

The model was just about as accurate as expected given the cross-validation results. Since it doesn’t look like our model is overfit, let’s use it to make a prediction on whether a new batted ball would result in a home run.

new_hit = pd.DataFrame({
  "pitch_mph" : 84,
  "launch_speed" : 88,
  "launch_angle" : 47,
  "Cover" : "Outdoor"
}, index = [0])

pipe_knn.predict(new_hit)

## array([0], dtype=int64)

pipe_knn.predict_proba(new_hit)

## array([[1., 0.]])

Looks like our model is very certain that won’t be a home run.

Summary

There it is – we’ve walked through a basic modeling process using the {sklearn} framework. We’ll discuss this in greater detail and introduce more advanced techniques throughout MAT434.

Extension: Hyperparameters and Tuning

The workflow in the example outlined above is a standard basic approach to modeling. We built two vanilla models (keeping all default parameters) and applied them to our home run prediction problem. Taking a closer look at these models, how sure are we that we have the “best” model? We identified that the KNN classifier outperformed the decision tree classifier, but that is for a specific value of \(k\) (nearest neighbors) and for an unconstrained decision tree. Even though we didn’t override any settings, we made modeling choices by accepting the default parameters.

These parameters which must be set prior to your model seeing its training data are called model hyperparameters. Each class of model has different hyperparameters available to it, and you’ll be introduced to them throughout our course. Ideally, we’d like the optimal settings for these hyperparameters to be learned from our data, but how do we make that happen if they need to be set prior to the model seeing any training data?

Keep Calm and Cross-Validate

The {sklearn} module contains a few specialized functions for using cross-validation to search over combinations of hyperparameter settings. I’ll introduce one here: GridSearchCV() from the sklearn.model_selection sub-module. Let’s import it and then see it in action.

from sklearn.model_selection import GridSearchCV

We’ll attempt to optimize our KNeighborsClassifier() over its n_neighbors hyperparameter, and our DecisionTreeClassifier() over its max_depth hyperparameter. We’ll create the parameter grids below. These are essentially lists of options for each hyperparameter. Note that for grids as part of a pipeline, the key for the index must be of the form "StepName__ParameterName". We’ll keep these parameter grids separate since they correspond to different modeling pipelines. An important consideration is that, in the case where we attempt to optimize a single model class over multiple hyperparameters, the GridSearchCV() function will test models over all combinations of hyperparameters in each list. Be careful with this because it can result in cross-validating over many, many models.

Note. you may want to check out RandomizedSearchCV if you have lots of hyperparameters to optimize over.

grid_neighbors = [{
  "model__n_neighbors" : [1, 3, 5, 7, 11, 15, 21, 41]
}]

grid_depth = [{
  "model__max_depth" : [2, 3, 4, 5, 8, 10, 12, 15, 20]
}]

Now that we have the grids, we’ll run GridSearchCV() to cross-validate over the hyperparameter values and identify the optimal setting for each model. We’ll reset our Pipeline()s since those were actually fit to data earlier.

pipe_knn = Pipeline([
  ("preprocessor", preprocessor_knn),
  ("model", knn_clf)
])
pipe_dt = Pipeline([
  ("preprocessor", preprocessor_dt),
  ("model", dt_clf)
])

grid_results_knn = GridSearchCV(pipe_knn, param_grid = grid_neighbors, scoring = "accuracy", cv = 5)

grid_results_knn.fit(X_train, y_train)

grid_results_dt = GridSearchCV(pipe_dt, param_grid = grid_depth, scoring = "accuracy", cv = 5)

grid_results_dt.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num_cols',
                                                                         Pipeline(steps=[('num_imputer',
                                                                                          SimpleImputer(strategy='median'))]),
                                                                         ['pitch_mph',
                                                                          'launch_speed',
                                                                          'launch_angle']),
                                                                        ('cat_cols',
                                                                         Pipeline(steps=[('cat_imputer',
                                                                                          SimpleImputer(strategy='most_frequent')),
                                                                                         ('one-hot',
                                                                                          OneHotEncoder())]),
                                                                         ['Cover'])])),
                                       ('model', DecisionTreeClassifier())]),
             param_grid=[{'model__max_depth': [2, 3, 4, 5, 8, 10, 12, 15, 20]}],
             scoring='accuracy')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

GridSearchCV

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num_cols',
                                                                         Pipeline(steps=[('num_imputer',
                                                                                          SimpleImputer(strategy='median'))]),
                                                                         ['pitch_mph',
                                                                          'launch_speed',
                                                                          'launch_angle']),
                                                                        ('cat_cols',
                                                                         Pipeline(steps=[('cat_imputer',
                                                                                          SimpleImputer(strategy='most_frequent')),
                                                                                         ('one-hot',
                                                                                          OneHotEncoder())]),
                                                                         ['Cover'])])),
                                       ('model', DecisionTreeClassifier())]),
             param_grid=[{'model__max_depth': [2, 3, 4, 5, 8, 10, 12, 15, 20]}],
             scoring='accuracy')

estimator: Pipeline

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num_cols',
                                                  Pipeline(steps=[('num_imputer',
                                                                   SimpleImputer(strategy='median'))]),
                                                  ['pitch_mph', 'launch_speed',
                                                   'launch_angle']),
                                                 ('cat_cols',
                                                  Pipeline(steps=[('cat_imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('one-hot',
                                                                   OneHotEncoder())]),
                                                  ['Cover'])])),
                ('model', DecisionTreeClassifier())])

preprocessor: ColumnTransformer

ColumnTransformer(transformers=[('num_cols',
                                 Pipeline(steps=[('num_imputer',
                                                  SimpleImputer(strategy='median'))]),
                                 ['pitch_mph', 'launch_speed', 'launch_angle']),
                                ('cat_cols',
                                 Pipeline(steps=[('cat_imputer',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('one-hot', OneHotEncoder())]),
                                 ['Cover'])])

num_cols

['pitch_mph', 'launch_speed', 'launch_angle']

SimpleImputer

SimpleImputer(strategy='median')

cat_cols

['Cover']

SimpleImputer

SimpleImputer(strategy='most_frequent')

OneHotEncoder

OneHotEncoder()

DecisionTreeClassifier

DecisionTreeClassifier()

knn_gridsearch_results = pd.DataFrame(grid_results_knn.cv_results_)
dt_gridsearch_results = pd.DataFrame(grid_results_dt.cv_results_)

Now that we’ve run our cross-validated grid search, let’s see the results for each.

py$knn_gridsearch_results %>%
  arrange(-mean_test_score) %>%
  select(param_model__n_neighbors, mean_test_score, std_test_score) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))

param_model__n_neighbors	mean_test_score	std_test_score
11	0.9564103	0.0028828
15	0.9563077	0.0018462
7	0.9557949	0.0024997
21	0.9557949	0.0024786
5	0.9536410	0.0045245
41	0.9520000	0.0023071
3	0.9490256	0.0041989
1	0.9308718	0.0025825

It looks like our best-performing nearest neighbor models are with 11 and 15 neighbors. These models both sit at about 95.6% accuracy. There isn’t much difference in estimated (average) performance. The standard deviations in the accuracy on folds is higher with 11 neighbors than with 15. The choice of which setting to consider optimal is yours and just should be justified.

py$dt_gridsearch_results %>%
  arrange(-mean_test_score) %>%
  select(param_model__max_depth, mean_test_score, std_test_score) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("hover", "striped"))

param_model__max_depth	mean_test_score	std_test_score
5	0.9585641	0.0019297
4	0.9571282	0.0016410
8	0.9556923	0.0023745
3	0.9555897	0.0023523
2	0.9554872	0.0023253
10	0.9528205	0.0016217
12	0.9468718	0.0027825
15	0.9412308	0.0042857
20	0.9369231	0.0047557

Check this out! It looks like our optimized decision tree classifier (max_depth = 5) beats our best KNN classifier. The accuracy for this model is nearly 95.9% and has a standard deviation in fold accuracies comparable to the KNN models. Hyperparameter tuning helps improve your models on the margins, but sometimes these improvements can be quite significant.