I have posted this question before but wanted to give a more thorough explanation of my problem so as to hopefully garner some assistance. I am also posting it on Talk Stats under the same title.
I have a mutiple regression model with over 10,000 records and though the model has an R^2 of 91%, the lack of fit tests and slightly S-shaped normal probability plot of the residuals show suggest curvature in the data. My objective here is to figure out and correct/ transform the curvature so as to fit the fit appropriately.
I have tried every method I have researched that could possibly help, though it is possible that I still missed something. Here is what I have done:
1) Added quadratic and cubic terms of each x. This only seemed to exacerbate the curavture. I also tried only added these terms for one x at a time and the results were the same.
2) Used the natural log of y and all x's. This again exacerbated the S shaped residuals plot.
3) I have 3 categorical variables and thinking that variance between them was an issue, I added weights for each category by 1/each category's variance. This also yielded no benefit to the residual plot or fit of the model.
UPDATED PER MICHELLE'S REQUEST:
–I am trying to predict sales volume by the amount to which our price is more or less than our main compeitor. I am in a high volume commoditized industry so I expect to be able to get some good results.
–My response is sales volume (y) and my predictors are the 4 week average of the sales volume for that same day and half-hour period (x1) and the amount we are above or below (in cents) our main competitor (x2). Also the categorical variable is store and I have modeled both using this as a variable and without.
— Here are my model results:
The regression equation is
Gallons = 4.19 - 293 MUSAPriceMinusLowKey + 0.983 MUSA4wkgal
Predictor Coef SE Coef T P
Constant 4.1932 0.7771 5.40 0.000
MUSAPriceMinusLowKey -292.75 13.07 -22.41 0.000
MUSA4wkgal 0.982941 0.002580 380.92 0.000
S = 47.1163 R-Sq = 91.0% R-Sq(adj) = 91.0%
Analysis of Variance
Source DF SS MS F P
Regression 2 323496065 161748033 72861.37 0.000
Residual Error 14352 31860609 2220
Total 14354 355356674
Source DF Seq SS
MUSAPriceMinusLowKey 1 1380837
MUSA4wkgal 1 322115228
Source DF Seq SS
PriceMinusLowKey 1 177122
4wkgal 1 8611828
Lack of fit test
Possible curvature in variable MUSAPric (P-Value = 0.000 )
Possible interaction in variable MUSA4wkg (P-Value = 0.000 )
Overall lack of fit test is significant at P = 0.000
So I am at my wit's end now and am not sure what to do next. Is it possible that the S shape residual plot is common for a dataset with so many observations and it is not a concern? Thanks in advance for any help you can give.
Best Answer
Assuming a fixed effects model, with that number of records I would be amazed if you had a "textbook" normal probability plot. In my experience, analysis with large datasets, and certainly when they hit around 10K observations, have these types of issues. Have you:
anova()
to compare the models (if you've saved the models as an object) to see which one is best.This is a little difficult to answer without actually seeing your normal probability plot, and the rest of the information from the points I list above.
Update based on comments below: Sometimes you just can't get a better result from your data. I would simply include the residuals plot with your report/presentation as well as the other information so that people can judge your results for themselves. You've put some effort into improving the model - you should mention this and the impacts it had in your report/presentation as well. On your question about a better approach, one analyses the data one has in order to test a hypothesis. You say you have months of data, does that mean you have a time series, in which case would a time series analysis be more appropriate than a regression?
Can you update your question with:
2nd update: More questions/thoughts:
gallons
) for that matched half hour period on the day of interest? So you've got rows that all represent half-hourgallons
, which you've linked to the preceding 4 week average for that day/half hour?gallons
or have you done a 4 week average for that?3rd update: I had assumed a volume product like petrol for point 3 above, but maybe you are looking at beverages so there may be a different "day" or "time of day" effect for that, e.g. lunch time sales, "movies" sales.
Also, if your
gallons
measure is sequential, there may be autocorrelation which I don't think you would notice in any plots because you simply have so much data, and most plots I have seen demonstrating this for students normally have 100, or even only 10 plots so that people can see the autocorrelation easily. As noted at the end of that page, you should do the Durbin-Watson test for autocorrelation. The only situation I think of wheregallons
has a lower probably of occurring is if yourgallons
dates and times have been randomly sampled, because this should break any autocorrelation effect.Sorry to hit you with another test, but the computer will calculate it really fast.