Solved – Linear regression with unbalanced dumthe variables + not normally distributed residuals

categorical datanormal distributionregressionspssunbalanced-classes

I am conducting a multiple linear regression analysis in SPSS.

My DV is a score between 0 and 6, and my predictors are:

  • one dichotomous nominal variable (native vs. non-native speakers)
  • one continuous variable (%)
  • one nominal variable with 5 levels (five different countries of origin) coded as dummy variables.

The sample sizes / frequencies for three of the dummy variables are under the recommended 15% of the whole sample size (namely 12,5%, 5% and 5%). Is it a big problem? Can I correct for this with some procedure?

Moreover, in view of my pp plot / residuals histogram:
residuals histogram versus normal

I think the residuals of my data are not normally distributed. Can I correct for this with some procedure?

Best Answer

As your DV can only take integer values it isn't really a continuous variable, so it's not surprising that your plot of residuals has a set of peaks and valleys around an underlying normal curve. Technically such data are best handled by ordinal regression, which allows for an ordered set of discrete categories in the DV.

Depending on the stage of your study, you may be able to learn enough from your multiple regression as you have performed it. You typically want more categories in the DV than this to approximate a set of integer values as a continuous DV, but the residuals don't seem to be skewed and if they also don't depend on the predicted values then you might be doing well enough. You will have to be careful to admit that your data might not strictly meet the criteria for standard interpretation of p-values.

With respect to the imbalanced distribution among levels of your nominal independent variable, one problem is that you will have more imprecise estimates of the effects of the levels that have the smaller numbers of cases. The estimates of the effects of the different countries of origin will also be correlated. Furthermore, the imbalance can complicate some types of ANOVA tests in which you try to apportion variance between, say, between the native/non-native and country-of-origin predictors. Finally, it will be hard to be sure that your results will generalize well to other samples. Short of obtaining more data there isn't much you can do about the imbalance. You might consider bootstrapping to get a bit more confidence in the generalizability of your model than the initial multiple regression can provide on its own.

Related Question