Solved – Simple linear regression vs. partial least squares (PLS)

partial least squaresrstepwise regression

I want to build a predictive model of an event in the spring based off of the weather during the winter (variable every year) and the soil characteristics (fixed) of many different sites.

Although I started with a fairly big dataset, by the time I averaged the values over the year (for 1 "event" value per year) and got rid of some of the sites that didn't fit the right profile, I only have 26 response variable values in the training set.

Looking at the weather data, there are many ways I can define what happened with the weather in the winter – how cold (lowest temp, mean min temp, frequency of days under -x degrees, etc), how warm (similar), how wet (mean precip, cumulative precip, etc.). The soil characteristics are more straightforward, but I can choose how deep I draw these characteristics from.

It was recommended I start with linear stepwise regression and "play around" with models until I find a good fit. So I just chose a few (non-correclated) predictor variables that made the most scientific sense, wrote a simple linear model, stepwise kicked almost everything out, and that was that, one variable explained everything. I also ran the same model through LOOCV and got different results. None of the fits are great, but I don't expect them to be.

If you consider the different weather and soils variables I can come up with, plus transformations and/or interactions of these, plus the order I list them for stepwise regression, I could be "playing around" with models forever.

I started reading Applied Predictive Modeling and read about PLS and it sounds really good. What I really like is that it deals well with correlated predictors. What I am thinking about doing is creating as many predictor variables as I can think of and running them through PLS to find which variables explain the response the most. Is that really how it works?

Then would I take these top predictor variables and use them to write a simple linear regression model that I would test on my validation set???

It just feels like there are so many things I could do here, but writing out hundreds of models by hand and testing each one does not seem like one of them.

I am doing all of this in R.

Best Answer

It was recommended I start with linear stepwise regression and "play around" with models until I find a good fit. So I just chose a few (non-correclated) predictor variables that made the most scientific sense, wrote a simple linear model, stepwise kicked almost everything out, and that was that, one variable explained everything.

I think this is the recipe for overfitting. If you are after a predictive model and apply this methodology, you will end up with the variable(s) that explains your training set very well, however this set of variables is NOT guaranteed to perform well on any other data, such as an independent test set that is not used for training. Also, If you try to select variables based on their performances on both training and validation set, you will end up overfitting to both sets.

If you scale your data(0 mean, and 1 std for each variable) and apply PLS to that data, the obtained beta vector which is the vector/matrix of coefficients for each variable is related with the contribution of corresponding variable. I think this logic can be applied to all regression models. PLS also have the advantage of avoiding over-fitting, if you choose the correct number of components with using RMSEP values obtained with LOOCV for each number of components(it is also called Latent Variables), for example. You can also compare VIP scores for each variable. Here is the article about it:

https://doi.org/10.1016/j.chemolab.2012.07.010

There are alternatives namely Ridge and LASSO; the former mainly aims to avoid overfitting where as the latter is used for variable selection, PLS can be specifically useful when some combination of the correlated variables may carry a meaning, that is usually reflected to number of components. All in all, I would avoid stepwise-regression and stick with one of these methods.