Purpose: In this notebook we’ll introduce the notion of Principal Component Analysis (PCA) for dimension reduction. In particular we’ll see

The Big Idea

We’ll start with a very simple, small example with just two features since that is easy to visualize.

Looking at the plot, you may notice that X1 and X2 are correlated with one another. If you’ve taken linear algebra, you may be able to recognize that basis vectors corresponding to X1 (that is, \(\left[\begin{array}{c} 1\\ 0\end{array}\right]\)) and X2 (that is, \(\left[\begin{array}{r} 0\\ 1\end{array}\right]\)) are perhaps not optimal for representing this data. Principal components analysis is a method for changing to a more appropriate basis. If you haven’t taken linear algebra, then all you need to know is that Principal Components are a way to more efficiently encode our data, while maintaining as much of the variability as possible.

The first principal component will identify the direction through the “scattercloud” of data which captures as much of the variability as possible. In some sense, you can think of this as the first principal component finding the direction of the longest “diameter” through the scattercloud. The second principal component will do the same thing, but with the restriction that the direction must be orthogonal to (at a 90-degree angle with) the first principal component. The third principal component will do the same thing, but must be orthogonal to the first two principal components, and so on.

Seeing this with our small example will help!

Note that even if we just kept the first principal component and dropped the second principal component (going from two variables to one variable), we would maintain over 99.5% of the variation in the original data. We can see this in the table below.

terms value component id
percent variance 99.6123102 1 pca_IDiSB
percent variance 0.3876898 2 pca_IDiSB

Now, can you imagine how helpful this might be if we had hundreds or thousands of available predictors? Perhaps we could encode much of the variability across 100 predictors in just five or ten principal components.

One drawback of using PCA is that the resulting model uses linear combinations of the original predictors. That is, our model is not directly interpretable with respect to the original variables.

The Curse of Dimensionality

Principal Component Analysis is often used to mediate what is referred to as the curse of dimensionality. The basic idea is this that, the region of the feature-space required to contain an expected proportion of observations grows exponentially in the number of dimensions. That is, data requirements explode as more features are utilized. An example will help us.

Example: Consider a collection of features X1, X2, …, Xp the observed values for which are all uniformly distributed over the interval \(\left[0, 1\right]\).

Put another way, if we had 1,000 observations, the expected number of observations falling into the interval of width \(0.1\) corresponding to X1 is 100. In the two variable case, the number of observations expected to fall into a square region in the two variable case with side length of \(0.1\) is 10. In the three-variable case, the cube is expected to contain 1 observations. Once we move to four or more variables there are no observations expected.

The visuals below may help. Note that in the plot on the left, the “y”-coordinate is meaningless – I’ve added some random noise there to make the observations discernible from one another. In that plot we are only interested in the X1 (horizontal) position of the observations. In the plot on the right, we’ve added a dimension by plotting the X1 and X2 location of the observations.

Principal components helps to mediate the curse of dimensionality by compressing the feature space back down to fewer variables.

How to Implement in {tidymodels}

Principal Component Analysis is a feature engineering step (that is, it is a step_ in a recipe()). We define our intention to use PCA as follows

my_rec <- recipe(<your_formula>, data = train) %>%
  step_pca()

By default, step_pca() will try to find principal components for all predictors. You may want to limit this step to include all_numeric_predictors() which can be done by passing this as an argument to the step. Additionally, PCA is a distance-sensitive process. For this reason, we should scale our numeric predictors prior to using step_pca(). That is, your recipe is more likely to look like:

my_rec <- recipe(<your_formula>, data = train) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_pca(all_numeric_predictors())

with other feature engineering step_*() functions likely to be included in the pipeline as well.

Hyperparameters and Other Extras

There are several options and tunable hyperparameters associated with step_pca)(). You are likely to use:

  • num_comp, which determines the number of principal components to compute, or
  • threshold, which determines the proportion of total variance that should be covered by the principal components.

Both of these directly impact the number of principal components which will result from using the recipe step. If threshold is used, then num_comp will be ignored.

You can see the full {recipe} documentation for step_pca() here.

How to Implement in {sklearn}

Principal Component Analysis is a feature engineering step. This means that it will be a component in a preprocessing Pipeline(). We’ll import PCA from sklearn.decomposition and then use it as a column transformer applied to numeric columns.

Note. This requires that the necessary imports for Pipeline(), SimpleImputer(), and StandardScaler() are also included.

from sklearn.decomposition import PCA

num_pipe = Pipeline([
  ("num_impute", SimpleImputer(strategy = "median")),
  ("norm", StandardScaler()),
  ("pca", PCA())
])

Running principal components analysis will require that we have no missing values, which is why we have the imputation step. Additionally, PCA is distance-based, so scaling your numeric predictors is necessary prior to running the procedure.

Now num_pipe can be included in a ColumnTransformer() and overarching modeling Pipeline() as we’ve seen done in earlier notebooks.

Hyperparameters and Other Extras

There are several options and tunable hyperparameters associated with PCA(). You are likely to use:

  • n_components, which determines the number of principal components to compute. Note that this can be

    • an integer which directly defines the number of principal components to keep, or
    • a float between 0.0 and 1.0 which determines the target explained variance to achieve

You can see the full {sklearn} documentation for PCA here.


Summary

Principal Component Analysis is a technique using linear algebra to create a new set of synthetic features, principal components, from the original features present in a dataset. It is often the case that much of the variability in the original dataset can be encapsulated in a number of principal components much smaller than the size of the original feature set. In doing this, we decrease the number of features used in our model (reducing the risk of overfitting) and those principal components we’ve obtained are also uncorrelated with one another.

In our class meeting we’ll see how to use PCA to reduce the dimensionality in a dataset on gene-expressions in cancerous tumors.