R – Scatterplot Smoothing with Large Dataset: Exploring Different Methods

generalized-additive-modelloessrsmoothing

I have a large dataset (>300,000 rows) with two variables. y is binary and x is continuous & numeric. I'd like to plot y and add smooth curve against x. I understand that loess(y~x) is a solution, but since I have such a big dataset, it takes too long to run, even if I set the 'cell' parameter to 500.

Using scatter.smooth, it runs much faster and I think it also uses loess. but I have trouble understanding the parameter 'evaluation = 50'. Does this mean that it only uses 1/50 of data to produce the smooth curve?

I also tried using geom_smooth, it would automatically switch to 'method=gam' since I have more than 1000 data points. but the curve looks different from the one I got using scatter.smooth (I guess that's normal as they are different models).

My goal was just to see the pattern of the data. Which smoothing method should I use? Can I trust scatter.smooth? what's the difference between using loess and gam?

below is the plot from scatter.smooth. It looks good, but it runs so much faster than the regular loess(). I'm not sure how it works…
enter image description here

Using the method whuber provided:
enter image description here

any help would be highly appreciated!

Thanks

Best Answer

It's actually efficient and accurate to smooth the response with a moving-window mean: this can be done on the entire dataset with a fast Fourier transform in a fraction of a second. For plotting purposes, consider subsampling both the raw data and the smooth. You can further smooth the subsampled smooth. This will be more reliable than just smoothing the subsampled data.

Control over the strength of smoothing is achieved in several ways, adding flexibility to this approach:

A larger window increases the smooth.
Values in the window can be weighted to create a continuous smooth.
The lowess parameters for smoothing the subsampled smooth can be adjusted.

Example

First let's generate some interesting data. They are stored in two parallel arrays, times and x (the binary response).

set.seed(17)
n <- 300000
times <- cumsum(sort(rgamma(n, 2)))
times <- times/max(times) * 25
x <- 1/(1 + exp(-seq(-1,1,length.out=n)^2/2 - rnorm(n, -1/2, 1))) > 1/2

Here is the running mean applied to the full dataset. A fairly sizable window half-width (of $1172$) is used; this can be increased for stronger smoothing. The kernel has a Gaussian shape to make the smooth reasonably continuous. The algorithm is fully exposed: here you see the kernel explicitly constructed and convolved with the data to produce the smoothed array y.

k <- min(ceiling(n/256), n/2)  # Window size
kernel <- c(dnorm(seq(0, 3, length.out=k)))
kernel <- c(kernel, rep(0, n - 2*length(kernel) + 1), rev(kernel[-1]))
kernel <- kernel / sum(kernel)
y <- Re(convolve(x, kernel))

Let's subsample the data at intervals of a fraction of the kernel half-width to assure nothing gets overlooked:

j <- floor(seq(1, n, k/3)) # Indexes to subsample

In the example j has only $768$ elements representing all $300,000$ original values.

The rest of the code plots the subsampled raw data, the subsampled smooth (in gray), a lowess smooth of the subsampled smooth (in red), and a lowess smooth of the subsampled data (in blue). The last, although very easy to compute, will be much more variable than the recommended approach because it is based on a tiny fraction of the data.

plot(times[j], x[j], col="#00000040", xlab="x", ylab="y")
a <- times[j]; b <- y[j]   # Subsampled data
lines(a, b, col="Gray")
f <- 1/6                   # Strength of the lowess smooths
lines(lowess(a, f=f)$y, lowess(b, f=f)$y, col="Red", lwd=2)
lines(lowess(times[j], f=f)$y, lowess(x[j], f=f)$y, col="Blue")

The red line (lowess smooth of the subsampled windowed mean) is a very accurate representation of the function used to generate the data. The blue line (lowess smooth of the subsampled data) exhibits spurious variability.

Related Solutions

Solved – How to tune smoothing in mgcv GAM model

The k argument effectively sets up the dimensionality of the smoothing matrix for each term. gam() is using a GCV or UBRE score to select an optimal amount of smoothness, but it can only work within the dimensionality of the smoothing matrix. By default, te() smooths have k = 5^2 for 2d surfaces. I forget what it is for s() so check the documents. The current advice from Simon Wood, author of mgcv, is that if the degree of smoothness selected by the model is at or close to the limit of the dimensionality imposed by the value used for k, you should increase k and refit the model to see if a more complex model is selected from the higher dimensional smoothing matrix.

However, I don't know how locfit works, but you do need to have something the stops you from fitting too complex a surface (GCV and UBRE, or (RE)ML if you choose to use them [you can't as you set scale = -1], are trying to do just that), that is not supported by the data. In other words, you could fit very local features of the data but are you fitting the noise in the sample of data you collected or are you fitting the mean of the probability distribution? gam() may be telling you something about what can be estimated from your data, assuming that you've sorted out the basis dimensionality (above).

Another thing to look at is that the smoothers you are currently using are global in the sense that the smoothness selected is applied over the entire range of the smooth. Adaptive smoothers can spend the allotted smoothness "allowance" in parts of the data where the response is changing rapidly. gam() has capabilities for using adaptive smoothers.

See ?smooth.terms and ?adaptive.smooth to see what can be fitted using gam(). te() can combine most if not all of these smoothers (check the docs for which can and can't be included in tensor products) so you could use an adaptive smoother basis to try to capture the finer local scale in the parts of the data where the response is varying quickly.

I should add, that you can get R to estimate a model with a fixed set of degrees of freedom used by a smooth term, using the fx = TRUE argument to s() and te(). Basically, set k to be what you want and fx = TRUE and gam() will just fit a regression spline of fixed degrees of freedom not a penalised regression spline.

Solved – Are LOESS and GAM with one covariate the same

Not really a full answer, but too long for a comment: s sets up a spline, whereas loess does a local regression.

In the gam package (maybe mgcv too, not too familiar with that one) you can also feed a local regression, as in

library(gam)

set.seed(1234) 

# generate data
x <- sort(runif(100)) 
y <- sin(2*pi*x) + rnorm(10, sd=0.1) 

gam.1 <- gam(y ~ lo(x))
base.r <- loess(y ~ x) 
summary(base.r$fitted - gam.1$fitted)
plot(base.r$fitted,gam.1$fitted)

That does not produce the same fitted values either, but maybe you can further play around with the settings of lo and loess.

Best Answer

Example

Related Solutions

Solved – How to tune smoothing in mgcv GAM model

Solved – Are LOESS and GAM with one covariate the same

Related Question