Machine Learning – Difference Between K-Nearest Neighbor and Adaptive K-Nearest Neighbor Algorithms

artificial intelligenceclassificationmachine learning

While studying KNN, I have come to see "Adaptive K-Nearest Neighbor algorithm" and I understand KNN but I wonder how "Adaptive K-Nearest Neighbor" work. I tried to find some materials but have failed to find a reasonable one, on the Internet, that I can approximately understand. Hope to hear some explanations.

Best Answer

As you probably already know, in the classical $k$-Nearest Neighbor (kNN) algorithm, you choose a natural number $k$, and for any new example $X$, you assign to it the most represented class amongst its $k$ nearest neighbors $X_{(1)},\ldots,X_{(k)}$ in the dataset.
As briefly discussed in the Wiki article however, the choice of $k$ is a big deal and will drastically impact the model performance. There are different ways to choose $k$ in order to minimize the error made by the model, but it can never be perfect (small $k$ implies high bias while high $k$ implies high variance).

That is the issue the adaptive $k$-NN algorithm tries to fix : instead of fixing one $\bf k$ with which predictions are made for all unseen $\bf X$, the value of $\bf k$ used to make the prediction depends on "where" in the feature space $\bf X$ is located.
In other words, if the $X$ for which we're trying to predict the class is in a "low-noise" region, the algorithm will choose a smaller value of $k$, whereas if $X$ is in a "very noisy" region, the adaptive $k$-NN will use a higher value of $k$ to make its prediction.

So that's the main idea. As for how exactly the rule for the choice of $k$ is made, different rules have been proposed by different authors. I refer you to the short (4 pages), sweet, and very clear paper by S. Sun and R. Huang for a more comprehensive introduction : An Adaptive $k$-Nearest Neighbor Algorithm (2010).


Edit : By "noisy region", I mean a region where both classes (in binary classification) are very represented, i.e. close to the decision boundary, compared to a "noiseless" region where all the points belong to one class.
To take your example, let's say you have recorded the height and weight of a bunch of mice, and recorded whether they were obese or not, and next you want to predict if a new mouse is obese given its height and weight. So your dataset is made of pairs $(X_i,y_i)$, with $X_i = (h_i,w_i) \in \mathbb R^2$ the predictor and $y_i \in \{0,1\}$ represents whether the mouse is obese or not. In this scenario, intuitively, it is easy to see that in the region where the weight $w$ is larger than (say) 100g, the mice will almost certainly be obese, so there is no need to compare a new point $X$ in that region to many others examples to decide. On the contrary, since the size and weight are most likely very correlated, for simultaneously large $w$ and $h$, you will encounter many of both cases, so you might want to compare to a greater number of examples if you're in that region (the decision boundary).

As for how to adaptively choose $k$ and figure out that a region is noisy, that is done through data-driven procedures. Here is a sketch of the method the authors use in the paper I mentioned above (see Section III.A and Table 1 in the paper) :

  1. For each $X_i$ in the training set, compute the smallest $k=k(X_i) \in \{1,\ldots,9\}$ such that a kNN algorithm with that value of $k$ correctly classifies $X_i$.
  2. For a new datapoint $X$, find its nearest neighbor in the dataset $X_{i_0}$.
  3. Apply the kNN procedure to $X$ with $k = k(X_{i_0})$ computed in step 1.

Note that in Step 1, the value $9$ is an arbitrary choice made by the authors, and it is possible that no kNN algorithm with value $k\in\{1,\ldots,9\}$ correctly classifies $X_i$, in which case they just set $k(X_i) = 9$.