{sklearn}
Workflow Overviewimport pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, precision_score, recall_score
Typical Steps: In this notebook, our goals are to identify a standard modeling framework and workflow which can be applied across many scenarios. The main steps are below, but more detailed discussions of each follow.
While code will be exhibited in this notebook, no code will actually be run.
We always need to split our data into training and test sets. As a reminder, we can think of training data as observations that our models can practice on and learn from (say a practice exam) and test observations as “new” observations that our model(s) were unable to study from. These test observations play the role of future observations and ensure that our model performs as expected when applied to observations it hasn’t seen before.
The functionality for splitting data into training and test sets appears below.
train, test = train_test_split(myData, train_size = 0.75, random_state = 434)
Exploratory data analysis, computing summary statistics and engaging
in data visualization is an important part of any modeling project. We
must make sure to do this work with our training data
(train
) because otherwise, information from our
test data (test
) leaks into our training process.
This means that the test
data is no longer an unbiased
estimate of model performance – we, and therefore our model, know
information about the observations in the test set that we should not
know.
More on exploratory data analyses and data visualization are covered
in the Day
2 and Day
4 notebooks. While written in R, since the {plotnine}
module is a port of R’s {ggplot2}
package, you might also
be interested in this
review notebook on data visualization.
Our models are going to learn from their
train
ing data. If we allow the models to see the response
variable that they are trying to predict, they’ll quickly learn just to
look at that column and report back its value. Unfortunately, such a
model is only useful if we already know the value of the outcome
variable – that is, such a model is not useful at all. For this reason,
we’ll need to separate our response variable from the available
features (predictors). The code to do this is below.
X_train = train.drop("response", axis = 1)
y_train = train["response"]
X_test = test.drop("response", axis = 1)
y_test = test["response"]
Now we are ready to start thinking about models.
Notice that the code to import the
DecisionTreeClassifier
from sklearn.tree
is
included at the top of this notebook. It is polite to keep all of your
import statements at the beginning of a nodebook or script. We’ll always
need to import any model constructors we wish to use. In order to
create an instance of a model constructor which can be fit to
data, we’ll define an object of this model constructor class. We can
pass parameter values to the model class which will override the default
settings. Get in the habit of looking at the sklearn
documentation for your model class to identify which parameters are
available for your model. You can typically find the documentation by
searching for sklearn <model_class>
. The code below
creates two separate instances of decision tree classifiers – one tree
is unconstrained while the other is limited to a maximum depth of 4.
(We’ll be talking more about tree-based models later in our course, so
don’t worry if you aren’t quite sure what having a maximum depth of 4
means.)
#An unconstrained decision tree
dt_clf = DecisionTreeClassifier()
#A decision tree with maximum depth 4
dt_4_clf = DecisionTreeClassifier(max_depth = 4)
The result of the code above will be two model constructors, but they won’t actually be fit to our data. They’ll just be instances which are waiting to learn.
Making use of Pipeline()
s and
ColumnTransformer()
s is an important aspect of improving
your modeling workflow. The difference between a Pipeline()
and ColumnTransformer()
is that column transformers can be
applied to specific columns, while a pipeline will be applied to all
columns passed to it. Making use of these structures will ensure that
your train
ing data, test
data, and any new
data go through the same transformations before being introduced to the
model. They’ll also cut down on the amount of work you need to do, and
will reduce the likelihood of errors. You should definitely use
Pipeline()
s in your workflow. The code required to
construct an example Pipeline()
appears below and will be
explained directly afterwards.
cat_cols = ["list", "categorical", "columns", "here"]
num_cols = ["list", "numeric", "columns", "here"]
cat_pipe = Pipeline([
("cat_imputer", SimpleImputer(strategy = "most_frequent")),
("one_hot", OneHotEncoder())
])
num_pipe = Pipeline([
("num_imputer", SimpleImputer(strategy = "median")),
("norm", StandardScaler())
])
preprocessor = ColumnTransformer([
("num", num_pipe, num_cols),
("cat", cat_pipe, cat_cols)
])
pipe = Pipeline([
("preprocessing", preprocessor),
("model", dt_clf)
])
There’s a lot to unpack here! We began by defining
cat_cols
and num_cols
, list of categorical
columns and numerical columns respectively. We did this because we’d
like to apply different transformations to those columns. We created a
multi-step Pipeline()
for the transformation of categorical
variables and a similar one for the transformation of our numerical
variables. Both Pipeline()
s consisted of an
imputer to fill in any missing values and another transformer.
The OneHotEncoder()
takes categorical columns and breaks
them out into a series of binary indicator columns – one for each level
of the categorical variable. The StandardScaler()
takes a
numerical variable and standardizes it by subtracting its mean
and dividing by its standard deviation.
With those two separate Pipeline()
s built, we combined
them into a preprocessing ColumnTransformer()
so that we
could apply to Pipeline()
s only to the columns for which
they make sense. Finally, we constructed an overarching
Pipeline()
which contains all of our preprocessing
steps as well as our model. Since the preprocessing steps and
model are now all contained in a Pipeline()
, they are a
single unit which can be tuned, fit, and assessed.
Note: If we want to see what our training data will
look like after preprocessing, we are able to do so by fitting
the preprocessor
and temporarily transforming that
set. The columns in the training set will change during preprocessing
though, so we’ll need to extract the new column names from the
preprocessor
if we want to make sense of the resulting data
frame. The following code will achieve all of this.
preprocessor.fit(X_train)
cat_columns = preprocessor.named_transformers_["cat"]["one_hot"].get_feature_names_out(cat_cols)
num_columns = preprocessor.named_transformers_["num"]["norm"].get_feature_names_out(num_cols)
columns = np.concatenate([num_columns, cat_columns])
display(X_train)
display(pd.DataFrame(preprocessor.transform(X_train), columns = columns))
Okay, the good news is that the majority of the hard work is done.
We’ll now want to assess our model’s performance using
cross-validation. This is the work of the
cross_val_score()
function. That function will
As has been the case in this notebook, the code for that appears below.
cv_results = cross_val_score(pipe, X_train, y_train, cv = 5, scoring = "accuracy")
cv_results
(cv_results).mean()
The first line in the code chunk above would run cross validation on five folds, recording the accuracy of each fitted model on the hold-out fold. The next line would print out the accuracies on the individual folds – we would have five accuracies printed out, one for each fold. The last line would compute the “cross validation accuracy” by averaging the accuracies across the five folds.
Note: One of the problems that cross-validation seeks to solve is that models assessments conducted on a single validation are not stable because the validation set could be “too easy” or “too hard”, just by random chance. This averaging over fold performances results in a more stable performance estimate.
Generally, your best model will be the model that optimizes your cross-validation performance metric. That may be minimizing cross-validation error, or maximizing a type of accuracy metric. You’ll cross-validate lots of modeling pipelines and identify your best one(s).
The problem with the cross-validation I’ve shown you above is that
won’t leave you with a fitted model. You’ll need to go back and fit the
model to your training data before you can use it. You can
fit your modeling pipeline by using the .fit()
method with your training data (X_train
and
y_train
) as the arguments. This is done using the code
below.
pipe.fit(X_train, y_train)
We can use our entire modeling pipeline to make predictions on
new data by using the .predict()
method with the
new data as the only argument. The new data, however, must have the
exact same columns as X_train
(and in the exact same
order), otherwise errors are thrown. The code below is used to make
predictions on your test set by allowing your modeling pipeline.
Aside. It is important to note that your test data will go through exactly the same transformations as the training data did. One small, but very important, aspect of this is that any preprocessing transformations will be made using parameters learned from the training set. For example, numeric columns will be normalized using the mean and standard deviations learned from the training set.
pipe.predict(X_test)
In order to verify our model’s performance, however, we’ll need to compare these predictions to the known responses. The best way to do this is to store the model predictions and then to use your desired performance metric to compare the truth and your predictions. Again, the necessary code is below.
test_preds = pipe.predict(X_test)
model_accuracy = accuracy_score(y_test, test_preds)
model_accuracy
Now we can compare our model’s accuracy on the truly unseen test data to our results from cross validation. Comparing this result to the accuracy achieved on the individual folds can help us determine whether accidental overfitting has occurred. If your metric measured on the test data is much poorer than the measurement on the individual folds then something has gone wrong. You’ve leaked information to the model somehow and you’ll need to go back and fix that issue. If the metric measured on the test set seems in line with the measurements on the individual cross-validation folds, then we can have confidence that we aren’t overfitting and we can move forward with using this model on new data.
Making predictions on new data is done in exactly the same way as
making predictions on the test data was. Again, your new data must have
exactly the same columns, in exactly the same order as
X_train
. Otherwise you’ll be met with an error. Example
code is below again; you can compare it to the code used to make
predictions on the test set to see that it is the same.
new_preds = pipe.predict(new_data)
The detailed discussions above have broken up the
{sklearn}
workflow and perhaps made it seem daunting. Here
is a condensed version:
#training and test sets
train, test = train_test_split(my_data, train_size = 0.75, random_state = 434)
#Split features and response
X_train = train.drop("response", axis = 1)
y_train = train["response"]
X_test = test.drop("response", axis = 1)
y_test = test["response"]
#Create an instance of a model constructor
dt_clf <- DecisionTreeClassifier()
#Create a Data Transformation and Model Pipeline
cat_cols = ["list", "categorical", "columns", "here"]
num_cols = ["list", "numeric", "columns", "here"]
cat_pipe = Pipeline([
("cat_imputer", SimpleImputer(strategy = "most_frequent")),
("one_hot", OneHotEncoder())
])
num_pipe = Pipeline([
("num_imputer", SimpleImputer(strategy = "median")),
("norm", StandardScaler())
])
preprocessor = ColumnTransformer([
("num", num_pipe, num_cols),
("cat", cat_pipe, cat_cols)
])
pipe = Pipeline([
("preprocessing", preprocessor),
("model", dt_clf)
])
#Run cross-validation to obtain cross-validation performance estimate
cv_results = cross_val_score(pipe, X_train, y_train, cv = 5, scoring = "accuracy")
#Collect cross-validation results
cv_results
cv_results.mean()
#Fit model to training data
pipe.fit(X_train, y_train)
#Assess model on test data
test_preds = pipe.predict(X_test)
accuracy_score(y_test, test_preds)
#Use model to predict for new data
pipe.predict(new_data)
The steps in the code block above can be copied/pasted/adapted while
you continue to build familiarity with the modeling process and also
with the {sklearn}
framework. Generally, you’ll build
several modeling pipelines and assess those models using
cross-validation. Doing this allows you to identify your best performing
models.