Purpose: In this notebook we’ll introduce the notion of Principal Component Analysis (PCA) for dimension reduction. In particular we’ll see
We’ll start with a very simple, small example with just two features since that is easy to visualize.
Looking at the plot, you may notice that X1
and
X2
are correlated with one another. If you’ve taken linear
algebra, you may be able to recognize that basis vectors corresponding
to X1
(that is, \(\left[\begin{array}{c} 1\\
0\end{array}\right]\)) and X2
(that is, \(\left[\begin{array}{r} 0\\
1\end{array}\right]\)) are perhaps not optimal for representing
this data. Principal components analysis is a method for changing to a
more appropriate basis. If you haven’t taken linear algebra, then all
you need to know is that Principal Components are a way to more
efficiently encode our data, while maintaining as much of the
variability as possible.
The first principal component will identify the direction through the “scattercloud” of data which captures as much of the variability as possible. In some sense, you can think of this as the first principal component finding the direction of the longest “diameter” through the scattercloud. The second principal component will do the same thing, but with the restriction that the direction must be orthogonal to (at a 90-degree angle with) the first principal component. The third principal component will do the same thing, but must be orthogonal to the first two principal components, and so on.
Seeing this with our small example will help!
Note that even if we just kept the first principal component and dropped the second principal component (going from two variables to one variable), we would maintain over 99.5% of the variation in the original data. We can see this in the table below.
terms | value | component | id |
---|---|---|---|
percent variance | 99.6123102 | 1 | pca_IDiSB |
percent variance | 0.3876898 | 2 | pca_IDiSB |
Now, can you imagine how helpful this might be if we had hundreds or thousands of available predictors? Perhaps we could encode much of the variability across 100 predictors in just five or ten principal components.
One drawback of using PCA is that the resulting model uses linear combinations of the original predictors. That is, our model is not directly interpretable with respect to the original variables.
Principal Component Analysis is often used to mediate what is referred to as the curse of dimensionality. The basic idea is this that, the region of the feature-space required to contain an expected proportion of observations grows exponentially in the number of dimensions. That is, data requirements explode as more features are utilized. An example will help us.
Example: Consider a collection of features
X1
, X2
, …, Xp
the observed values
for which are all uniformly distributed over the interval \(\left[0, 1\right]\).
X1
), an interval of width
\(0.1\) in X1
would be
expected to contain about 10% of all of the observations.X1
and X2
),
intervals of width \(0.1\) in
X1
and X2
would result in a square region that
is only be expected to contain about 1% of all of the observations. In
order to build a region expected to contain 10% of all observations, we
would need intervals of width over \(0.3\).X1
, X2
, and
X3
), intervals of width \(0.1\) in X1
, X2
,
and X3
would result in a cube region that is only be
expected to contain about 0.1% of all of the observations. In order to
build a region expected to contain 10% of all observations, we would
need intervals of width over \(0.46\).Put another way, if we had 1,000 observations, the expected number of
observations falling into the interval of width \(0.1\) corresponding to X1
is
100. In the two variable case, the number of observations expected to
fall into a square region in the two variable case with side length of
\(0.1\) is 10. In the three-variable
case, the cube is expected to contain 1 observations. Once we move to
four or more variables there are no observations expected.
The visuals below may help. Note that in the plot on the left, the
“y”-coordinate is meaningless – I’ve added some random noise there to
make the observations discernible from one another. In that plot we are
only interested in the X1
(horizontal) position of the
observations. In the plot on the right, we’ve added a dimension by
plotting the X1
and X2
location of the
observations.
Principal components helps to mediate the curse of dimensionality by compressing the feature space back down to fewer variables.
{tidymodels}
Principal Component Analysis is a feature engineering step (that is,
it is a step_
in a recipe()
). We define our
intention to use PCA as follows
my_rec <- recipe(<your_formula>, data = train) %>%
step_pca()
By default, step_pca()
will try to find principal
components for all predictors. You may want to limit this step to
include all_numeric_predictors()
which can be done by
passing this as an argument to the step. Additionally, PCA is a
distance-sensitive process. For this reason, we should scale our numeric
predictors prior to using step_pca()
. That is, your recipe
is more likely to look like:
my_rec <- recipe(<your_formula>, data = train) %>%
step_normalize(all_numeric_predictors()) %>%
step_pca(all_numeric_predictors())
with other feature engineering step_*()
functions likely
to be included in the pipeline as well.
There are several options and tunable hyperparameters
associated with step_pca)()
. You are likely to use:
num_comp
, which determines the number of principal
components to compute, orthreshold
, which determines the proportion of total
variance that should be covered by the principal components.Both of these directly impact the number of principal components
which will result from using the recipe step. If threshold
is used, then num_comp
will be ignored.
You can see the full {recipe}
documentation for
step_pca()
here.
{sklearn}
Principal Component Analysis is a feature engineering step. This
means that it will be a component in a preprocessing
Pipeline()
. We’ll import PCA
from
sklearn.decomposition
and then use it as a column
transformer applied to numeric columns.
Note. This requires that the necessary imports for
Pipeline()
,SimpleImputer()
, andStandardScaler()
are also included.
from sklearn.decomposition import PCA
num_pipe = Pipeline([
("num_impute", SimpleImputer(strategy = "median")),
("norm", StandardScaler()),
("pca", PCA())
])
Running principal components analysis will require that we have no missing values, which is why we have the imputation step. Additionally, PCA is distance-based, so scaling your numeric predictors is necessary prior to running the procedure.
Now num_pipe
can be included in a
ColumnTransformer()
and overarching modeling
Pipeline()
as we’ve seen done in earlier notebooks.
There are several options and tunable hyperparameters
associated with PCA()
. You are likely to use:
n_components
, which determines the number of
principal components to compute. Note that this can be
You can see the full {sklearn}
documentation for
PCA
here.
Principal Component Analysis is a technique using linear algebra to create a new set of synthetic features, principal components, from the original features present in a dataset. It is often the case that much of the variability in the original dataset can be encapsulated in a number of principal components much smaller than the size of the original feature set. In doing this, we decrease the number of features used in our model (reducing the risk of overfitting) and those principal components we’ve obtained are also uncorrelated with one another.
In our class meeting we’ll see how to use PCA to reduce the dimensionality in a dataset on gene-expressions in cancerous tumors.