import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, precision_score, recall_score

Typical Steps: In this notebook, our goals are to identify a standard modeling framework and workflow which can be applied across many scenarios. The main steps are below, but more detailed discussions of each follow.

While code will be exhibited in this notebook, no code will actually be run.

Splitting Data into Training and Test Sets

We always need to split our data into training and test sets. As a reminder, we can think of training data as observations that our models can practice on and learn from (say a practice exam) and test observations as “new” observations that our model(s) were unable to study from. These test observations play the role of future observations and ensure that our model performs as expected when applied to observations it hasn’t seen before.

The functionality for splitting data into training and test sets appears below.

train, test = train_test_split(myData, train_size = 0.75, random_state = 434)

Exploratory Analyses on Training Data

Exploratory data analysis, computing summary statistics and engaging in data visualization is an important part of any modeling project. We must make sure to do this work with our training data (train) because otherwise, information from our test data (test) leaks into our training process. This means that the test data is no longer an unbiased estimate of model performance – we, and therefore our model, know information about the observations in the test set that we should not know.

More on exploratory data analyses and data visualization are covered in the Day 2 and Day 4 notebooks. While written in R, since the {plotnine} module is a port of R’s {ggplot2} package, you might also be interested in this review notebook on data visualization.

Separating Features and Response

Our models are going to learn from their training data. If we allow the models to see the response variable that they are trying to predict, they’ll quickly learn just to look at that column and report back its value. Unfortunately, such a model is only useful if we already know the value of the outcome variable – that is, such a model is not useful at all. For this reason, we’ll need to separate our response variable from the available features (predictors). The code to do this is below.

X_train = train.drop("response", axis = 1)
y_train = train["response"]
X_test = test.drop("response", axis = 1)
y_test = test["response"]

Now we are ready to start thinking about models.

Defining a Model Constructor

Notice that the code to import the DecisionTreeClassifier from sklearn.tree is included at the top of this notebook. It is polite to keep all of your import statements at the beginning of a nodebook or script. We’ll always need to import any model constructors we wish to use. In order to create an instance of a model constructor which can be fit to data, we’ll define an object of this model constructor class. We can pass parameter values to the model class which will override the default settings. Get in the habit of looking at the sklearn documentation for your model class to identify which parameters are available for your model. You can typically find the documentation by searching for sklearn <model_class>. The code below creates two separate instances of decision tree classifiers – one tree is unconstrained while the other is limited to a maximum depth of 4. (We’ll be talking more about tree-based models later in our course, so don’t worry if you aren’t quite sure what having a maximum depth of 4 means.)

#An unconstrained decision tree
dt_clf = DecisionTreeClassifier()

#A decision tree with maximum depth 4
dt_4_clf = DecisionTreeClassifier(max_depth = 4)

The result of the code above will be two model constructors, but they won’t actually be fit to our data. They’ll just be instances which are waiting to learn.

Creating a Transformation Pipeline

Making use of Pipeline()s and ColumnTransformer()s is an important aspect of improving your modeling workflow. The difference between a Pipeline() and ColumnTransformer() is that column transformers can be applied to specific columns, while a pipeline will be applied to all columns passed to it. Making use of these structures will ensure that your training data, test data, and any new data go through the same transformations before being introduced to the model. They’ll also cut down on the amount of work you need to do, and will reduce the likelihood of errors. You should definitely use Pipeline()s in your workflow. The code required to construct an example Pipeline() appears below and will be explained directly afterwards.

cat_cols = ["list", "categorical", "columns", "here"]
num_cols = ["list", "numeric", "columns", "here"]

cat_pipe = Pipeline([
    ("cat_imputer", SimpleImputer(strategy = "most_frequent")),
    ("one_hot", OneHotEncoder())
])
num_pipe = Pipeline([
    ("num_imputer", SimpleImputer(strategy = "median")),
    ("norm", StandardScaler())
])

preprocessor = ColumnTransformer([
  ("num", num_pipe, num_cols),
  ("cat", cat_pipe, cat_cols)
  ])

pipe = Pipeline([
    ("preprocessing", preprocessor),
    ("model", dt_clf)
])

There’s a lot to unpack here! We began by defining cat_cols and num_cols, list of categorical columns and numerical columns respectively. We did this because we’d like to apply different transformations to those columns. We created a multi-step Pipeline() for the transformation of categorical variables and a similar one for the transformation of our numerical variables. Both Pipeline()s consisted of an imputer to fill in any missing values and another transformer. The OneHotEncoder() takes categorical columns and breaks them out into a series of binary indicator columns – one for each level of the categorical variable. The StandardScaler() takes a numerical variable and standardizes it by subtracting its mean and dividing by its standard deviation.

With those two separate Pipeline()s built, we combined them into a preprocessing ColumnTransformer() so that we could apply to Pipeline()s only to the columns for which they make sense. Finally, we constructed an overarching Pipeline() which contains all of our preprocessing steps as well as our model. Since the preprocessing steps and model are now all contained in a Pipeline(), they are a single unit which can be tuned, fit, and assessed.

Note: If we want to see what our training data will look like after preprocessing, we are able to do so by fitting the preprocessor and temporarily transforming that set. The columns in the training set will change during preprocessing though, so we’ll need to extract the new column names from the preprocessor if we want to make sense of the resulting data frame. The following code will achieve all of this.

preprocessor.fit(X_train)

cat_columns = preprocessor.named_transformers_["cat"]["one_hot"].get_feature_names_out(cat_cols)
num_columns = preprocessor.named_transformers_["num"]["norm"].get_feature_names_out(num_cols)
columns = np.concatenate([num_columns, cat_columns])

display(X_train)
display(pd.DataFrame(preprocessor.transform(X_train), columns = columns))

Model Evaluation Using Cross-Validation Scoring

Okay, the good news is that the majority of the hard work is done. We’ll now want to assess our model’s performance using cross-validation. This is the work of the cross_val_score() function. That function will

As has been the case in this notebook, the code for that appears below.

cv_results = cross_val_score(pipe, X_train, y_train, cv = 5, scoring = "accuracy")

cv_results
(cv_results).mean()

The first line in the code chunk above would run cross validation on five folds, recording the accuracy of each fitted model on the hold-out fold. The next line would print out the accuracies on the individual folds – we would have five accuracies printed out, one for each fold. The last line would compute the “cross validation accuracy” by averaging the accuracies across the five folds.

Note: One of the problems that cross-validation seeks to solve is that models assessments conducted on a single validation are not stable because the validation set could be “too easy” or “too hard”, just by random chance. This averaging over fold performances results in a more stable performance estimate.

Identify a Best Model(s)

Generally, your best model will be the model that optimizes your cross-validation performance metric. That may be minimizing cross-validation error, or maximizing a type of accuracy metric. You’ll cross-validate lots of modeling pipelines and identify your best one(s).

Fit Best Model to Training Data

The problem with the cross-validation I’ve shown you above is that won’t leave you with a fitted model. You’ll need to go back and fit the model to your training data before you can use it. You can fit your modeling pipeline by using the .fit() method with your training data (X_train and y_train) as the arguments. This is done using the code below.

pipe.fit(X_train, y_train)

Verify Model Performance on Test Data

We can use our entire modeling pipeline to make predictions on new data by using the .predict() method with the new data as the only argument. The new data, however, must have the exact same columns as X_train (and in the exact same order), otherwise errors are thrown. The code below is used to make predictions on your test set by allowing your modeling pipeline.

Aside. It is important to note that your test data will go through exactly the same transformations as the training data did. One small, but very important, aspect of this is that any preprocessing transformations will be made using parameters learned from the training set. For example, numeric columns will be normalized using the mean and standard deviations learned from the training set.

pipe.predict(X_test)

In order to verify our model’s performance, however, we’ll need to compare these predictions to the known responses. The best way to do this is to store the model predictions and then to use your desired performance metric to compare the truth and your predictions. Again, the necessary code is below.

test_preds = pipe.predict(X_test)
model_accuracy = accuracy_score(y_test, test_preds)
model_accuracy

Now we can compare our model’s accuracy on the truly unseen test data to our results from cross validation. Comparing this result to the accuracy achieved on the individual folds can help us determine whether accidental overfitting has occurred. If your metric measured on the test data is much poorer than the measurement on the individual folds then something has gone wrong. You’ve leaked information to the model somehow and you’ll need to go back and fix that issue. If the metric measured on the test set seems in line with the measurements on the individual cross-validation folds, then we can have confidence that we aren’t overfitting and we can move forward with using this model on new data.

Making Predictions with a Fitted Model

Making predictions on new data is done in exactly the same way as making predictions on the test data was. Again, your new data must have exactly the same columns, in exactly the same order as X_train. Otherwise you’ll be met with an error. Example code is below again; you can compare it to the code used to make predictions on the test set to see that it is the same.

new_preds = pipe.predict(new_data)

Summary

The detailed discussions above have broken up the {sklearn} workflow and perhaps made it seem daunting. Here is a condensed version:

#training and test sets
train, test = train_test_split(my_data, train_size = 0.75, random_state = 434)

#Split features and response
X_train = train.drop("response", axis = 1)
y_train = train["response"]
X_test = test.drop("response", axis = 1)
y_test = test["response"]

#Create an instance of a model constructor
dt_clf <- DecisionTreeClassifier()

#Create a Data Transformation and Model Pipeline
cat_cols = ["list", "categorical", "columns", "here"]
num_cols = ["list", "numeric", "columns", "here"]

cat_pipe = Pipeline([
    ("cat_imputer", SimpleImputer(strategy = "most_frequent")),
    ("one_hot", OneHotEncoder())
])
num_pipe = Pipeline([
    ("num_imputer", SimpleImputer(strategy = "median")),
    ("norm", StandardScaler())
])

preprocessor = ColumnTransformer([
  ("num", num_pipe, num_cols),
  ("cat", cat_pipe, cat_cols)
  ])

pipe = Pipeline([
    ("preprocessing", preprocessor),
    ("model", dt_clf)
])

#Run cross-validation to obtain cross-validation performance estimate
cv_results = cross_val_score(pipe, X_train, y_train, cv = 5, scoring = "accuracy")

#Collect cross-validation results
cv_results
cv_results.mean()

#Fit model to training data
pipe.fit(X_train, y_train)

#Assess model on test data
test_preds = pipe.predict(X_test)
accuracy_score(y_test, test_preds)

#Use model to predict for new data
pipe.predict(new_data)

The steps in the code block above can be copied/pasted/adapted while you continue to build familiarity with the modeling process and also with the {sklearn} framework. Generally, you’ll build several modeling pipelines and assess those models using cross-validation. Doing this allows you to identify your best performing models.