Purpose: In this notebook we introduce the \(k\) nearest neighbors classifier. In particular, we note that
the \(k\) in the name of the classifier indicates the number of voting neighbors when a prediction is made.
nearest neighbors is a distance-based model.
A nearest neighbor model assumes that observations are most like those nearest to them. The nearest neighbor model makes predictions on a new observation in the following way:
Identify the \(k\) observations which are closest to the new observation. (Note: It is possible to use a variety of distance metrics, not just the Euclidean metric.)
Allow those \(k\) “nearest neighbors” to vote on the response for the new observation.
The following visual may be helpful.
Given the two stage process, you might note that a nearest neighbor model isn’t actually trained. The “locations” and responses for the training data are simply stored, and when a prediction is to be made, the algorithm is run. This has several implications.
The nearest neighbor approach is not a model, it is really an algorithm to obtain predictions.
The nearest neighbor approach is expensive in terms of memory requirements because the entire “training” set must be stored in perpetuity.
The nearest neighbor approach is expensive in terms of compute time because the algorithm must be run in full every time a new prediction is to be made.
The nearest neighbors algorithm is non-parametric and no functional form is estimated.
Where nearest neighbor approaches have the drawbacks cited above, they do have some significant advantages. In particular, they can approximate very complex decision boundaries. We can also tune the parameter \(k\) in order to obtain an appropriately flexible boundary.
It is worth noting that \(k\) nearest neighbors can be used in both the classification and regression settings.
{tidymodels}
A KNN classifier is a model class (that is, a model
spec
ification). We define our intention to build a
KNN classifier using
knn_clf_spec <- nearest_neighbor() %>%
set_engine("kknn") %>%
set_mode("classification")
K Nearest Neighbors can be used for both regression and
classification. For this reason, the line to set_mode()
is
required when declaring the model specification. The line to
set_engine()
above is unnecessary since kknn
is the default engine. There are other available engines though.
Like other model classes, nearest neighbors have tunable hyperparameters. They are
neighbors
, which determines the number of voting
neighbors to determine the model’s ultimate prediction on a new
observation.
weight_func
determines how we’ll compute distances
between observations. More than just Euclidean distance
("rectangular"
) are available.
You can see the full {parsnip}
documentation for
nearest_neighbor()
here.
{sklearn}
A KNN classifier is a model class. We first import
KNeighborsClassifier
from sklearn.neighbors
and then create an instance of the model constructor using:
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier()
Note that, since KNN is a distance-based model, all numerical predictors should be scaled. Additionally, using categorical predictors with KNN models makes assumptions that aren’t well-supported. That being said, we can keep calm and cross-validate to determine whether using them improves model performance.
Like other model classes, nearest neighbors have tunable hyperparameters. The ones you are most likely to utilize are
n_neighbors
, which determines the number of
voting neighbors to determine the model’s ultimate prediction
on a new observation. The default is 5.
metric
determines how we’ll compute distances
between observations. More than just Euclidean distance are
available.
p
is an additional parameter associated with the metric
which determines the powers used in the distance calculation. For
example if p = 2
corresponds to Euclidean distance \(\displaystyle{\sqrt{d_1^2 + d_2^2 + \cdot +
d_k^2}}\). The p
parameter governs the powers and
the order of the root used.weights
determines whether votes will be weighted or
not. The possible values are uniform
(all votes are equal)
or distance
(closer neighbors have more powerful
votes).
You can see the full {sklearn}
documentation for
KNeighborsClassifier()
here.
In this notebook we learned a bit about the \(k\) nearest neighbors algorithm. In particular, we saw that this is a distance-based algorithm which can be used for either regression or classification. We also saw that it is a non-parametric method which does not result in a model and cannot be interpreted. The algorithm is run every time a new prediction is to be made, making nearest neighbors an expensive algorithm in terms of both space and time. While nearest neighbors has its drawbacks it is a powerful technique which can fit very complex decision boundaries in the classification setting and can approximate complex functions well in the regression setting. Nearest neighbors also makes no distribution assumptions about the underlying data – only that observations are most similar to those around them.
We’ll implement \(k\) nearest
neighbors using the nearest_neighbor()
model class
from {tidymodels}
in our next class meeting.