Solved – Heteroskedasticity – residual plot interpretation

heteroscedasticityregressionresiduals

I am plotting a residual plot to test for heteroskedasticity. The Breusch-Pagan test is significant and therefore I am suspecting there is evidence on heteroskedasticity. The question is:

(a) How can I interpret such graph? I know some points seem to be on top of each other etc. is it as simple as that?

(b) Also the data has a lot of binary dummy variables. Could this be the reason for heteroskedasticity?

Residual plot

Best Answer

Concerning heteroscedasticity, you are interested in understanding how the vertical spread of the points varies with the fitted values. To do this, you must slice the plot into thin vertical sections, find the central elevation (y-value) in each section, evaluate the spread around that central value, then connect everything up. Here are some possible slices:

Ordinarily this would be done using robust estimates of location and spread, such as a median and interquartile range. If we had the data, we might generate a wandering schematic plot. Data are difficult to extract numerically from a graphical image with overplotted points. However, in this case the vertical spreads tend to be compact, symmetrical, and without outliers, so we are safe using means and standard deviations instead--and these are easily computed using image processing software. In fact, what I did was to smear the dots horizontally and then compute the mean and variance of their locations for every vertical column of pixels in the image. (This processing will be a little inaccurate due to overplotting of some points, but it's not likely to bias the relative SDs much.)

There is a definite wedge shape to the smeared points, narrowing from left to right. (Squinting at a graphic can sometimes help bring out such an overall gestalt impression of a scatterplot, provided it has many points.)

The mean (shown below in blue) and the mean plus or minus a suitable multiple of the square root of the variance (in red and gold) will trace out the location and typical limits of the residuals.

I chose a multiple designed to place about 5% of the points above the upper trace and another 5% below the lower trace.

With practice you can see such traces by closely examining the plot itself--no calculations are necessary. Scanning across from left to right, estimate the middle of each vertical column of dots. Estimate their spread. Inflate your estimates of spread a little where there are relatively fewer dots--they haven't had a chance to show the full amount of their dispersion. At the same time, discount your estimates (that is, don't give them much credence) in areas where there are very few dots, because your estimates are highly uncertain there.

Look for clear consistent patterns of changes in spread. In the preceding figure, the upper trace (red) and lower trace (gold) appear to draw a little closer together from left to right, as the fitted value increases. This can be made more apparent by plotting the standard deviation. The units don't matter, but the vertical axis should start at zero to give an accurate rendition of relative sizes of the spreads:

This confirms the initial impression of a decreasing SD with increasing fitted value. Overall, the SD is halved as we scan from left to right. (The slight upward increase at the very right can be discounted since it is associated with few data points.) This is a classic form of heteroscedasticity: the spread changes systematically with the fitted value.

The use of dummy variables in a multiple regression will not introduce heteroscedasticity. Often it will reduce it, by resolving overlapping groups of residuals into separate ones.

Whether heteroscedasticity is actually a problem depends on the purpose of the analysis, the regression method employed, what information is being extracted from the results, and the nature of the data.

Best Answer

Related Solutions

Solved – Detecting heteroscedasticity – Can I use Breusch-Pagan Test on binary logistic regression

Solved – Autocorrelation and heteroskedasticity in time series data

Related Question