Machine Learning – Relationship Between K, Parameters, and Complexity in K-Nearest-Neighbor

I want to understand the relationship (if any) between parameters and hyperparameters in a k-nearest-neighbor (KNN) model and how they relate to complexity.

Assume

A model which should classify handwritten numbers [0-9].

Dataset of N observation.

Below is my interpretation

Complexity in a KNN model is decided by the amount of features, 10 in this case, size of our dataset (N) and the value of K. If we have K=1, we will have a very complex model that will regard every datapoint, and effectively take into account N/1 = N parameters. Thereby a low K increases complexity by making the model regard every parameter.

Thinking about it as a linear problem, we would fit a very high degree polynomial when classifying.

If we however increase K, we simultaneously lower the amount of parameters the model have to take into account, thereby generating a less complex model. E.g K=2 would result in fitting a model on N/2 parameters.

Again thinking about it as a linear problem, we would end up with a lesser degree polynomial.

If we set K = N, we would have a model that's extremely simple that regards N/N = 1 parameter. In linear terms this would mean one single line that tries to separates our data.

I would like some confirmation on wheter I've understood this relationship between K, parameters and complexity correctly.

Best Answer

As you noticed, defining complexity can be challenging. For models such as linear regression, complexity is the function of the number of parameters. It is much harder to define it for other models. What would be the complexity for the random forest, neural networks, or for non-parametric models like $k$-NN? This is a non-trivial question. Questions like this were considered by researchers studying (lack of) bias-variance trade-off and I recommend you to check the papers mentioned in the linked answer. Those papers focus rather on a more abstract idea of model capacity, where the models with more capacity would be able to learn more complicated patterns, up to being able to fit completely random data.

As you noticed, $k$-NN with $k=1$ "learns" $N$ different "patterns" from the data, one per sample, with $k=N$ it can only make one prediction (the average). In such a case, we could say that $k$ controls model's capacity, or complexity, nonetheless that it doesn't have parameters.

Best Answer

Related Solutions

Solved – Regression: find the best degree of polynomial with the best regularization parameter

Solved – A modeling technique combining $k$ nearest neighbors and multiple linear regression

Related Question