Solved – How to decide what span to use in LOESS regression in R

loessrregression

I am running LOESS regression models in R, and I want to compare the outputs of 12 different models with varying sample sizes. I can describe the actual models in more details if it helps with answering the question.

Here are the sample sizes:

Fastballs vs RHH 2008-09: 2002
Fastballs vs LHH 2008-09: 2209
Fastballs vs RHH 2010: 527 
Fastballs vs LHH 2010: 449

Changeups vs RHH 2008-09: 365
Changeups vs LHH 2008-09: 824
Changeups vs RHH 2010: 201
Changeups vs LHH 2010: 330

Curveballs vs RHH 2008-09: 488
Curveballs vs LHH 2008-09: 483
Curveballs vs RHH 2010: 213
Curveballs vs LHH 2010: 162

The LOESS regression model is a surface fit, where the X location and the Y location of each baseball pitch is used to predict sw, swinging strike probability. However, I'd like to compare between all 12 of these models, but setting the same span (i.e. span = 0.5) will bear different results since there is such a wide range of sample sizes.

My basic question is how do you determine the span of your model? A higher span smooths out the fit more, while a lower span captures more trends but introduces statistical noise if there is too little data. I use a higher span for smaller sample sizes and a lower span for larger sample sizes.

What should I do? What's a good rule of thumb when setting span for LOESS regression models in R? Thanks in advance!

Best Answer

A cross-validation is often used, for example k-fold, if the aim is to find a fit with lowest RMSEP. Split your data into k groups and, leaving each group out in turn, fit a loess model using the k-1 groups of data and a chosen value of the smoothing parameter, and use that model to predict for the left out group. Store the predicted values for the left out group and then repeat until each of the k groups has been left out once. Using the set of predicted values, compute RMSEP. Then repeat the whole thing for each value of the smoothing parameter you wish to tune over. Select that smoothing parameter that gives lowest RMSEP under CV.

This is, as you can see, fairly computationally heavy. I would be surprised if there wasn't a generalised cross-validation (GCV) alternative to true CV that you could use with LOESS - Hastie et al (section 6.2) indicate this is quite simple to do and is covered in one of their exercises.

I suggest you read section 6.1.1, 6.1.2 and 6.2, plus the sections on regularisation of smoothing splines (as the content applies here too) in Chapter 5 of Hastie et al. (2009) The Elements of Statistical Learning: Data mining, inference, and prediction. 2nd Edition. Springer. The PDF can be downloaded for free.