Solved – How to choose appropriate bandwidth for kernel regression

kernel-smoothingnadaraya-watsonregression

I'm trying to understand how to choose an appropriate bandwidth for kernel regression. Note that this is NOT about kernel density estimation (unless someone can convince me that the same techniques can be used).

Here's my thinking on this: The bandwidth should be allowed to decrease as:

  • 1) more data is gathered.
  • 2) there are known variations/oscillations in the data of a certain size (e.g. a sine wave of an approximate frequency of 0.5 units of the predictor variable.)

These concepts are the same whether I'm talking about LOWESS or Nadaraya -Watson – they both use a bandwidth during estimation.

I'm aware of Silverman's rule for KDE, but is there an equivalent for kernel regression that captures my intuition above?

Of course I can determine it experimentally using brute force grid search method, but this is very computationally expensive, and won't scale when there are more than 2 dimensions. Thank you.

Best Answer

I would recommend you to read this beautiful article by Racine and Li published in the Journal of Econometrics in 2004. They develop a framework to estimate regression functions nonparametrically using kernel methods, with mixed types of covariates (categorical or continuous regressors). Among other results, they show consistency of the cross-validated estimates. This is a classical article in the nonparametric econometrics literature.

The main method for choosing bandwidth parameters is, undoubtfully, the cross-validation procedure. However, other methods exist such as bootstraping (a quick google search gives: a PhD thesis about choosing bandwidths for np kernel regression.

If you have a large sample, leave-one-out CV may not be the way to go for computational reasons. Moreover, if the data consist of time-series, the CV method may not be valid anymore. What you can do is proceed with hold-out validation. Assume you have $T$ observations.

  1. Split the sample into two parts: the estimation sample (observation 1 to $T-k$) and a hold-out sample (obs $T-k+1$ to $T$).
  2. Compute the estimator using the estimation sample (first $T-k$ obs) as a function of $h$.
  3. Compute the out-of-sample prediction for the hold-out sample (last $k$ obs) as a function of $h$.
  4. Minimize the squared prediction error with respect to $h$.
  5. Recompute $h$ using the same procedure as new data come in.

Also, if the Nadaraya-Watson estimator is indeed a np kernel estimator, this is not the case for Lowess, which is a local polynomial regression method. You could also fit your regression function using the Sieves (i.e. through a basis expansion of the function) based on wavelets for example given the structure of your data.

Finally, the np kernel estimation of densities is extremely similar to that of the conditional mean, which is what you have in mind when you talk about 'regression'. A np kernel regression considers estimating $E(Y|X)$, where $Y$ is the dependent variable and $X$ is a (hopefully) exogenous predictor. Replacing Y by $I(Y\leq y)$ - where $I$ denotes the indicator function that equates $1$ when the event inside the brackets occurs - gives the conditional mean expression $E(I(Y\leq y)|X)$. Now, run a bunch of np kernel regression on $E(I(Y\leq y)|X)$ for various values of $y$. This will provide an estimate the conditional cumulative distribution of $Y$ given $X$. Now take a derivative with respect to $y$, you have a density. So what's the difference after all? Just a matter of choosing the dependent variable. The method is the same, unless you want to re-scale and impose restrictions on the CDF maybe...

Related Question