While studying KNN, I have come to see "Adaptive K-Nearest Neighbor algorithm" and I understand KNN but I wonder how "Adaptive K-Nearest Neighbor" work. I tried to find some materials but have failed to find a reasonable one, on the Internet, that I can approximately understand. Hope to hear some explanations.
Machine Learning – Difference Between K-Nearest Neighbor and Adaptive K-Nearest Neighbor Algorithms
artificial intelligenceclassificationmachine learning
Related Solutions
This is the best visualization I can attempt to use to describe multi-label KNN. Let me know if you disagree.
In the plot below, individuals are one or more of the labels: {blue, orange, green}. As you can see, some individuals are both blue and orange, some green and orange. For the test subject I point to with the red arrow, the 7 nearest neighbors are probed.
From examination of those 7 nearest neighbors, you get the histogram below, yielding a final ranking class order of: Blue=Orange > Green, meaning this test subject is blue or orange before it is green. I don't know how precisely this translates to class probabilities. Would love to learn more?
You should first check out this Basic question about Fisher Information matrix and relationship to Hessian and standard errors
Suppose we have a statistical model (family of distributions) $\{f_{\theta}: \theta \in \Theta\}$. In the most general case we have $dim(\Theta) = d$, so this family is parameterized by $\theta = (\theta_1, \dots, \theta_d)^T$. Under certain regularity conditions, we have
$$I_{i,j}(\theta) = -E_{\theta}\Big[\frac{\partial^2 l(X; \theta)}{\partial\theta_i\partial\theta_j}\Big] = -E_\theta\Big[H_{i,j}(l(X;\theta))\Big]$$
where $I_{i,j}$ is a Fisher Information matrix (as a function of $\theta$) and $X$ is the observed value (sample)
$$l(X; \theta) = ln(f_{\theta}(X)),\text{ for some } \theta \in \Theta$$
So Fisher Information matrix is a negated expected value of Hesian of the log-probability under some $\theta$
Now let's say we want to estimate some vector function of the unknown parameter $\psi(\theta)$. Usually it is desired that the estimator $T(X) = (T_1(X), \dots, T_d(X))$ should be unbiased, i.e.
$$\forall_{\theta \in \Theta}\ E_{\theta}[T(X)] = \psi(\theta)$$
Cramer Rao Lower Bound states that for every unbiased $T(X)$ the $cov_{\theta}(T(X))$ satisfies
$$cov_{\theta}(T(X)) \ge \frac{\partial\psi(\theta)}{\partial\theta}I^{-1}(\theta)\Big(\frac{\partial\psi(\theta)}{\partial\theta}\Big)^T = B(\theta)$$
where $A \ge B$ for matrices means that $A - B$ is positive semi-definite, $\frac{\partial\psi(\theta)}{\partial\theta}$ is simply a Jacobian $J_{i,j}(\psi)$. Note that if we estimate $\theta$, that is $\psi(\theta) = \theta$, above simplifies to
$$cov_{\theta}(T(X)) \ge I^{-1}(\theta)$$
But what does it tell us really? For example, recall that
$$var_{\theta}(T_i(X)) = [cov_{\theta}(T(X))]_{i,i}$$
and that for every positive semi-definite matrix $A$ diagonal elements are non-negative
$$\forall_i\ A_{i,i} \ge 0$$
From above we can conclude that the variance of each estimated element is bounded by diagonal elements of matrix $B(\theta)$
$$\forall_i\ var_{\theta}(T_i(X)) \ge [B(\theta)]_{i,i}$$
So CRLB doesn't tell us the variance of our estimator, but wheter or not our estimator is optimal, i.e. if it has lowest covariance among all unbiased estimators.
Best Answer
As you probably already know, in the classical $k$-Nearest Neighbor (kNN) algorithm, you choose a natural number $k$, and for any new example $X$, you assign to it the most represented class amongst its $k$ nearest neighbors $X_{(1)},\ldots,X_{(k)}$ in the dataset.
As briefly discussed in the Wiki article however, the choice of $k$ is a big deal and will drastically impact the model performance. There are different ways to choose $k$ in order to minimize the error made by the model, but it can never be perfect (small $k$ implies high bias while high $k$ implies high variance).
That is the issue the adaptive $k$-NN algorithm tries to fix : instead of fixing one $\bf k$ with which predictions are made for all unseen $\bf X$, the value of $\bf k$ used to make the prediction depends on "where" in the feature space $\bf X$ is located.
In other words, if the $X$ for which we're trying to predict the class is in a "low-noise" region, the algorithm will choose a smaller value of $k$, whereas if $X$ is in a "very noisy" region, the adaptive $k$-NN will use a higher value of $k$ to make its prediction.
So that's the main idea. As for how exactly the rule for the choice of $k$ is made, different rules have been proposed by different authors. I refer you to the short (4 pages), sweet, and very clear paper by S. Sun and R. Huang for a more comprehensive introduction : An Adaptive $k$-Nearest Neighbor Algorithm (2010).
Edit : By "noisy region", I mean a region where both classes (in binary classification) are very represented, i.e. close to the decision boundary, compared to a "noiseless" region where all the points belong to one class.
To take your example, let's say you have recorded the height and weight of a bunch of mice, and recorded whether they were obese or not, and next you want to predict if a new mouse is obese given its height and weight. So your dataset is made of pairs $(X_i,y_i)$, with $X_i = (h_i,w_i) \in \mathbb R^2$ the predictor and $y_i \in \{0,1\}$ represents whether the mouse is obese or not. In this scenario, intuitively, it is easy to see that in the region where the weight $w$ is larger than (say) 100g, the mice will almost certainly be obese, so there is no need to compare a new point $X$ in that region to many others examples to decide. On the contrary, since the size and weight are most likely very correlated, for simultaneously large $w$ and $h$, you will encounter many of both cases, so you might want to compare to a greater number of examples if you're in that region (the decision boundary).
As for how to adaptively choose $k$ and figure out that a region is noisy, that is done through data-driven procedures. Here is a sketch of the method the authors use in the paper I mentioned above (see Section III.A and Table 1 in the paper) :
Note that in Step 1, the value $9$ is an arbitrary choice made by the authors, and it is possible that no kNN algorithm with value $k\in\{1,\ldots,9\}$ correctly classifies $X_i$, in which case they just set $k(X_i) = 9$.