Solved – Rationale for the use of Regressogram (Bin-Smooth)

binningdata visualizationregressionsmoothing

I am taking a class in data mining and we have recently been introduced to bin-smoothing in regression analysis but i cannot seem to understand the usefulness of this method nor how the method works or why it works. Basically, an illustration was given of a data set which cannot be fit using a linear model and the bin smooth was mentioned as a better model. See this lecture slide for the data set. This is not from my class but the bin smooth on the slide is identical. Could one explain the rational behind using averages as is done in bin-smooth?

Best Answer

So we're saying "Y is related to X", but we don't know the form. We want to estimate $E(Y|X=x)$. But at any observation, we have noise, and we need to be able to estimate it between observations.

One (fairly naive) way we can approximate it is to assume that while the average changes, that it doesn't change too rapidly (i.e. that it's 'slowly varying' in a particular sense). Consequently, we could slice the domain up into sections (bins), and calculate an estimate of $E(Y)$ - here just the sample mean of the $y$'s - for all the $x$-values in each section (bin).

That is to say, the averages of the $y$'s in a narrow strip of $x$'s will typically be closer to those $x$'s than averaging more widely (because of variation in the mean of y over the range of x - that is, lower bias), but much smoother than just taking say the nearest x-value (because you're typically averaging a bunch of x's - all the ones in the bin).

The graph then is just a horizontal line drawn within the bin, at the mean of the observations in that bin. Those horizontal bars will 'follow' the data simply because they're local means - within each bin they're the average of the observations in the bin. Obviously as the relationship moves up and down the local bin-means will also.

This is fairly simple and can cope with complicated relationships, if they don't change too rapidly.

But it's not smooth! There are discontinuities at the bin boundaries.

There's a tradeoff between bias (in the sense that wider bins mean that we move further from accuracy in approximating the mean at a particular $x$ as we get near the ends of the bin) and variance (in the sense that narrower bins mean we have less data, so the noise dominates more).

A slightly more sophisticated version of this kind of general idea would be Nadaraya-Watson kernel smoothing, in the same way that kernel density estimation is related to the histogram.

http://en.wikipedia.org/wiki/Kernel_regression

and then we can work from there up to say local linear or local polynomial smoothing - fitting local lines or curves rather than local means.

Edit: Here's an outline of the basic steps in the example data you pointed to.

Here's the first few observations in the motorcycle data (which is in mcycle in the MASS package in R):

> head(mcycle,10)
   times accel
1    2.4   0.0
2    2.6  -1.3
3    3.2  -2.7
4    3.6   0.0
5    4.0  -2.7
6    6.2  -2.7
7    6.6  -2.7
8    6.8  -1.3
9    7.8  -2.7
10   8.2  -2.7

Härdle (who I said hello to just a few days ago when he visited) does a regressogram of this data in "Applied Nonparametric Regression". He says he uses a bin width of 4, and that his bin origin is 0. This is plainly false since he gets a lot more than 5 bins between 0 and 20. But lets take his suggestion of a bin width of 4 and on origin of 0. I'm going to run my bins to exclude the left limit and include the right one (simply because that's the way the R cut function does it; otherwise I'd be inclined to run the other way).

So that means our first bin runs from time 0 to time 4 ($(0,4]$). What are the accelerations in that?

bins of width 4

The first 5 times (circled in red) are in the first bin. The next 4 times (in blue) are in the second. We average the accelerations in those time periods.

averages for bins of width 4

So $\hat{m}(x)$ is -1.34 for $x$ in $(0,4]$ and then it's -2.35 for $x$ in $(4,8]$ and then it's -2.844 ...

In R:

with(mcycle,print(tapply(accel,cut(times,seq(0,60,4)),mean),3))
   (0,4]    (4,8]   (8,12]  (12,16]  (16,20]  (20,24]  (24,28]  (28,32]  (32,36] 
  -1.340   -2.350   -2.844  -22.365  -78.176 -119.167  -45.159   27.990   25.040 
 (36,40]  (40,44]  (44,48]  (48,52]  (52,56]  (56,60] 
   3.643    4.862   -4.020   -0.867   -2.350   10.700

So if we plot those values across those ranges:

regressogram of mcycle data

(Larger, clearer version)

Incidentally, Härdle's actual binwidth appears to be 2.1 - if I use that binwidth with bin origin 0, I reproduce his bin smooth:

My 2.1 binwidth regressogram

Example

First let's generate some interesting data. They are stored in two parallel arrays, times and x (the binary response).

set.seed(17)
n <- 300000
times <- cumsum(sort(rgamma(n, 2)))
times <- times/max(times) * 25
x <- 1/(1 + exp(-seq(-1,1,length.out=n)^2/2 - rnorm(n, -1/2, 1))) > 1/2

Here is the running mean applied to the full dataset. A fairly sizable window half-width (of $1172$) is used; this can be increased for stronger smoothing. The kernel has a Gaussian shape to make the smooth reasonably continuous. The algorithm is fully exposed: here you see the kernel explicitly constructed and convolved with the data to produce the smoothed array y.

k <- min(ceiling(n/256), n/2)  # Window size
kernel <- c(dnorm(seq(0, 3, length.out=k)))
kernel <- c(kernel, rep(0, n - 2*length(kernel) + 1), rev(kernel[-1]))
kernel <- kernel / sum(kernel)
y <- Re(convolve(x, kernel))

Let's subsample the data at intervals of a fraction of the kernel half-width to assure nothing gets overlooked:

j <- floor(seq(1, n, k/3)) # Indexes to subsample

In the example j has only $768$ elements representing all $300,000$ original values.

The rest of the code plots the subsampled raw data, the subsampled smooth (in gray), a lowess smooth of the subsampled smooth (in red), and a lowess smooth of the subsampled data (in blue). The last, although very easy to compute, will be much more variable than the recommended approach because it is based on a tiny fraction of the data.

plot(times[j], x[j], col="#00000040", xlab="x", ylab="y")
a <- times[j]; b <- y[j]   # Subsampled data
lines(a, b, col="Gray")
f <- 1/6                   # Strength of the lowess smooths
lines(lowess(a, f=f)$y, lowess(b, f=f)$y, col="Red", lwd=2)
lines(lowess(times[j], f=f)$y, lowess(x[j], f=f)$y, col="Blue")

The red line (lowess smooth of the subsampled windowed mean) is a very accurate representation of the function used to generate the data. The blue line (lowess smooth of the subsampled data) exhibits spurious variability.

Best Answer

Related Solutions

Solved – Visualizing many p-values

Solved – scatterplot smoothing in r with big dataset: different methods

Example

Related Question