Solved – Multiple regression problem S Shape Residuals

regression

I have posted this question before but wanted to give a more thorough explanation of my problem so as to hopefully garner some assistance. I am also posting it on Talk Stats under the same title.

I have a mutiple regression model with over 10,000 records and though the model has an R^2 of 91%, the lack of fit tests and slightly S-shaped normal probability plot of the residuals show suggest curvature in the data. My objective here is to figure out and correct/ transform the curvature so as to fit the fit appropriately.

I have tried every method I have researched that could possibly help, though it is possible that I still missed something. Here is what I have done:

1) Added quadratic and cubic terms of each x. This only seemed to exacerbate the curavture. I also tried only added these terms for one x at a time and the results were the same.

2) Used the natural log of y and all x's. This again exacerbated the S shaped residuals plot.

3) I have 3 categorical variables and thinking that variance between them was an issue, I added weights for each category by 1/each category's variance. This also yielded no benefit to the residual plot or fit of the model.

UPDATED PER MICHELLE'S REQUEST:
–I am trying to predict sales volume by the amount to which our price is more or less than our main compeitor. I am in a high volume commoditized industry so I expect to be able to get some good results.

–My response is sales volume (y) and my predictors are the 4 week average of the sales volume for that same day and half-hour period (x1) and the amount we are above or below (in cents) our main competitor (x2). Also the categorical variable is store and I have modeled both using this as a variable and without.

— Here are my model results:

The regression equation is

Gallons = 4.19 - 293 MUSAPriceMinusLowKey + 0.983 MUSA4wkgal

Predictor                 Coef   SE Coef       T      P
Constant                4.1932    0.7771    5.40  0.000
MUSAPriceMinusLowKey   -292.75     13.07  -22.41  0.000
MUSA4wkgal            0.982941  0.002580  380.92  0.000

S = 47.1163   R-Sq = 91.0%   R-Sq(adj) = 91.0%

Analysis of Variance
Source             DF         SS         MS         F      P
Regression          2  323496065  161748033  72861.37  0.000
Residual Error  14352   31860609       2220
Total           14354  355356674

Source                DF     Seq SS
MUSAPriceMinusLowKey   1    1380837
MUSA4wkgal             1  322115228

Source                DF   Seq SS
 PriceMinusLowKey      1   177122
 4wkgal                1  8611828

Lack of fit test

Possible curvature in variable MUSAPric  (P-Value = 0.000 )
Possible interaction in variable MUSA4wkg  (P-Value = 0.000 )

Overall lack of fit test is significant at P = 0.000

So I am at my wit's end now and am not sure what to do next. Is it possible that the S shape residual plot is common for a dataset with so many observations and it is not a concern? Thanks in advance for any help you can give.

Best Answer

Assuming a fixed effects model, with that number of records I would be amazed if you had a "textbook" normal probability plot. In my experience, analysis with large datasets, and certainly when they hit around 10K observations, have these types of issues. Have you:

  1. tested for outliers or influential points, e.g. with Cook's d?
  2. compared the results of your various regressions with a measure like AIC or BIC? If you're using R, you can use anova() to compare the models (if you've saved the models as an object) to see which one is best.
  3. do you have hetereoscedasticity in the residuals, if you compare a plot of fits against them?

This is a little difficult to answer without actually seeing your normal probability plot, and the rest of the information from the points I list above.

Update based on comments below: Sometimes you just can't get a better result from your data. I would simply include the residuals plot with your report/presentation as well as the other information so that people can judge your results for themselves. You've put some effort into improving the model - you should mention this and the impacts it had in your report/presentation as well. On your question about a better approach, one analyses the data one has in order to test a hypothesis. You say you have months of data, does that mean you have a time series, in which case would a time series analysis be more appropriate than a regression?

Can you update your question with:

  1. the research question you have
  2. what your outcome/dependent variable is
  3. what your data are
  4. the model that you analysed (you can just copy and paste out of your model statement in your software)

2nd update: More questions/thoughts:

  1. you've used an average sales volume for a half hour period, is your dependent variable (gallons) for that matched half hour period on the day of interest? So you've got rows that all represent half-hour gallons, which you've linked to the preceding 4 week average for that day/half hour?
  2. for the variance in price from competitors, does that relate to the half hour gallons or have you done a 4 week average for that?
  3. given that day of the week probably influences sales, have you thought of adding in either a "weekend" indicator, or a "day of week" set of 6 indicators to your model? Because you've only got a 4-week average, you shouldn't need to worry too much about seasonality.

3rd update: I had assumed a volume product like petrol for point 3 above, but maybe you are looking at beverages so there may be a different "day" or "time of day" effect for that, e.g. lunch time sales, "movies" sales.

Also, if your gallons measure is sequential, there may be autocorrelation which I don't think you would notice in any plots because you simply have so much data, and most plots I have seen demonstrating this for students normally have 100, or even only 10 plots so that people can see the autocorrelation easily. As noted at the end of that page, you should do the Durbin-Watson test for autocorrelation. The only situation I think of where gallons has a lower probably of occurring is if your gallons dates and times have been randomly sampled, because this should break any autocorrelation effect.

Sorry to hit you with another test, but the computer will calculate it really fast.

Related Question