Solved – Intuitions about KNN and linear regression in a regression setting

cross-validationk nearest neighbourlinear modelregression

I'd like to develop some intuitions about linear regression and KNN in a regression setting.

Let's say I have 6 different datasets below. They all have 1 independent variable X and corresponding real value Y, and let's assume that X values do not overlap in each dataset. Imagine that I separate them into a training set and test set, apply KNN(K with 3 or 5) and linear regression(without nonlinear kernels) and validate with the test sets using MSE. Red lines are regression lines that I will get from linear regression. Here what I like to think about is the performances of KNN and linear regression with different shapes and variabilities of datasets.

enter image description here

Here are my conjectures.

With the dataset 1-1, KNN will result better performance because, although there is some linearity, the dataset disperses a lot making it more non-linear. With the dataset 1-2, I think KNN and linear regression will perform somewhat the same because the linearity and variability of the dataset compensates each other.

With 2-1, KNN will outperform a linear regression because the datasets look very non-linear. KNN also will outperforms with 2-2 because the dataset is also very non-linear.

The dataset 3-1 seems to have a strong linearity with gaussian noise. Thus, linear regression will bring about the better performance, because when KNN tests the points close to the linear line, it will predict values far from the linear line. The dataset 3-2 seems to have the same linearity, but with noise uniformly distributed. In this case, I think KNN will perform better because the points are not centered around the linear line, making linear regression hard to predict.

I think my reasoning is very weak and somewhat non-systematic. Can you please let me know I should interpret these datasets in relation to KNN and linear regression?

Best Answer

Well, at first you need to define order of your KNN model and order of your Linear regression

2-2 has some curvature, so obviously plain linear first regression would perform bad (MSE perspective).

With K=[6-9] and good training set I would say that KNN will beat Liner regression. Because in nature it tries to approximate based on some local vicinity. So if in this area you have 6-9 points your approximation will be good enough.

MSE KNN in most of the cases is smaller than Linear Regression, on rest of the cases it's slightly worse (so it seems that KNN is almost always better). But KNN has bad performance in high dimension spaces, MSE degrades much faster than LinearRegression solution (Intuition - if in this space you don't have any example, so point will be estimated based on long distant examples).

ISLR has good explanation of both cases in Chapter 3. http://www-bcf.usc.edu/~gareth/ISL/

Related Question