Solved – Improving a linear regression prediction

model selectionpredictive-modelsregression

I'm trying to create a prediction model. I have about 10,000 samples comprising of about 100 predictor variables and the response variable. At the moment I'm using a method which is apparently known as principal component regression (PCR). It's working ok, but I'm wondering whether we can do better. But I'm not a statistician and I'm having trouble keeping afloat in the sea of acronyms. I don't care about what the model looks like, I just want good predictions. Can you help direct where I should be looking?

Some information about my data:

A scatter plot of each predictor against the response generally gives an ellipse of points (some long and thin, some almost circular). Does this imply that a linear model is the right choice?
The predictors are measuring wildly different things, and some are multi-correlated. This is why I've normalised them and thrown them through a PCA.
Some predictors are probably completely useless.
There are errors and outliers in the predictors and the response. I've tried to manually get rid of outliers, but not too aggressively.

Best Answer

I think we'd like the subject narrowed down a bit too; same concerns here about our time and effectiveness. Do edit in more detail if you can. That being said, here's a first attempt at narrowing things down for you somewhat. I hate to do this to you, but I'm still going to end up tossing jargon at you that you might have to read into a bit to understand. (Hovering your cursor over our tags may be enough, hopefully!)

With data that large that's all aimed at predicting one thing, overfitting (overfitting) may pose one of the bigger problems for your predictive model, especially given high multicollinearity (multicollinearity) among some of your predictors. Using principal-components regression (PCR) should be a good way to handle multicollinearity, assuming you or your software exclude(s) the principal components with trivially small eigenvalues relative to the total sum of eigenvalues. "What's trivial?" may be a difficult question, but if you're lucky, you'll find natural gaps. Rank order your principal components by eigenvalue, and look for sharp drops in the size of each eigenvalue compared to the next smallest. In a relatively simple scenario, you'd want to use all the principal components before the last big drop in eigenvalue. You'll probably have a lot of relatively high eigenvalues that drop off gradually, but if you're lucky, it'll look something like this:

which is a relatively clear case for retaining three factors (bifactor analysis aside). Note @Scortchi's comment on this answer though: you'll want to be careful about throwing out PCs that are doing some real predictive work for your model, even if they have really small eigenvalues.

continuous-data is important in fitting a linear model, because binary data generally require logistic regression. If, as it sounds, your DV is continuous, and your ellipses aren't curvilinear, discontinuous, pear-shaped, etc., but just smoothly elliptical like variably elongated footballs (in the American sense), you're probably right to go with a linear model, though I don't know that you can really just eyeball this sort of thing. Run some basic regression diagnostics if you know or can figure out how. If any of your predictors are categorical-data, PCR probably won't know this, so you'll effectively be using them as approximations of continuous dimensions, which may not be safe, especially if they are nominal, or there are less than five (approximate rule of thumb) ordinal categories, or you don't actually have any reason to expect a normal distribution underlies your system of ordinal categories.

You may want to throw out the relatively useless predictors, which are conventionally identified by $t$-testing the regression coefficients. If the coefficients don't differ significantly from zero, they may be adding more error than information about the DV to your model's predictions. Lots of better ideas on how exactly to test which predictors to retain when you've got so many can be found in the discussion, Is adjusting p-values in a multiple regression for multiple comparisons a good idea? @whuber's suggestion to hold out some data for model validation is particularly straightforward and convincing in ways I think you'll find appealing.

If you were to care enough about what your model looks like, particularly in terms of how those principal components of your predictors organize themselves, you might consider piecing together a structural equation model (sem) of your own design. If you could model the latent factor structure of your set of predictors manually and accurately before using the latent factors to predict your DV, you could remove measurement-error from the factors in advance of doing the predictive modeling with them, and probably gain a better understanding of your model in the process. This could also let you identify mediation among your model's predictors, depending on how you organize it. I don't suppose you'd be inclined to care about that if you're only interested in prediction at the moment (and I don't mean to assume that you're wrong not to care), but if you ever find you need to explain how you're getting those predictions and why you think they're valid, you might have to revisit a lot of this when/if you do start caring. Therefore a little preemptive caring might be advisable, even if there's really no immediate reason! Then again, maybe you'll have more time later and be better able to afford starting over then, if necessary. Your call, your risk. Happy modeling, and may the trashy fiction be ever the result of someone else's work!

Best Answer

Related Solutions

Solved – Understanding Gaussian Basis function parameters to be used in linear regression

Solved – regression with circular response variable

Related Question