Solved – Linear regression and ordinal data

likertordinal-dataregression

I am running a multiple regression looking at whether TV viewing predicts waist circumference (WC). When I ran through the tests with my tutor we placed WC as the dependant and TV as independent, then ran it again with some of the potential confounders. However, the TV variable is ordinal (1- never, 2- 0 to 59 mins, 3- 1 to 2 hours, 4- 2 to 3 hours, 5- 3 to 4 hours, 6- 4 to 5 hours, and 7- 5+ hours). Should I be recoding these as dummy variables? I also plan to run regressions with similarly collected snack food consumption ordinal data. I'm hesitant to do it differently to my tutor (who is away for 2 weeks), and think it may be too many dummy variables, but is it still ok to interpret the results as I would a continuous dataset (in terms of mins/hours)?
Many thanks, Nick.

Best Answer

Entering your 7 levels as dummy variables would be more appropriate to the ordinal level of measurement, but there are a few caveats to consider:

This would mean estimating more parameters, which means fewer degrees of freedom and more risk of overfitting. These risks depend partly on your sample size. If you need to add other predictors and their interactions, that could compound these problems.
It would complicate interpretation and testing of any simple linear or curvilinear hypotheses you might have in mind for the relationship between TV and WC.
Your data aren't as messy as ordinal data can be. It seems you have a continuous time variable that may be zero-inflated (if there are plenty of of "never"s) and is right-censored (if time > 5h, you have no information on how much greater, right?) and binned. The bins are evenly spaced though, and you have at least five levels that correspond neatly to a continuum, so you may not lack too much of the information you might otherwise have with continuous measurements. Bollen and Barb ^{₍₁₉₈₁₎} found that pentachotomizing two continuous variables (i.e., splitting them into five groups, and yeah, I just made up that word) attenuated their simple correlation by about 9%, which isn't terrible compared to binning with fewer levels. If your WC measurements are continuous, then treating just this one binned variable as a continuous predictor might work okay.

Overfitting can be resisted and detection of a simple trend facilitated somewhat by using penalized regression to smooth the dummy coefficients (i.e., reducing differences in adjacent levels of TV). E.g., if the regression coefficients for the dummy variables corresponding to levels {3,4,5} are {.2,.5,.3} in a OLS model, they might be {.25,.43,.315} in a penalized regression model. If the reference level really is more different from level 4 than from levels 3 and 5, smoothing might not improve your predictions, but if the relationship between TV and WC in the population is actually monotonic, predictions would improve with reduction (if not elimination) of the spurious bump in the trend at level 4 (in this example). I always suggest Gertheiss and Tutz ^{₍₂₀₀₉₎} for an overview of penalized regression.

^{References

Bollen, K. A., & Barb, K. H. (1981). Pearson's $r$ and coarsely categorized measures. American Sociological Review, 46, 232–239. Retrieved from http://www.statpt.com/correlation/bollen_barb_1981.pdf.

Gertheiss, J., & Tutz, G. (2009). Penalized regression with ordinal predictors. International Statistical Review, 77(3), 345–365. Retrieved from http://epub.ub.uni-muenchen.de/2100/1/tr015.pdf.}

Related Solutions

Solved – Ordinal data in regression

Since your response is ordinal then you should use ordinal regression. At a very high level, the main difference ordinal regression and linear regression is that with linear regression the dependent variable is continuous and ordinal the dependent variable is ordinal.

Now you can usually use linear regression with an ordinal dependent variable but you will see that the diagnostic plots do not look good. When you say SPSS won't run the linear regression what do you mean? Are you getting an error?

Solved – Parametric tests and Likert Scales (Ordinal data) – Two different views

One way I approach this is to not take people's word for it, based on what appears to be either their beliefs, or precedent, but to try it out and see if (in your case) it matters in a way that you care about.

Here's a simple example: A 5 point Likert scale, with a uniform distribution. 100 people per group, and we'll do a two sample t-test. I'll repeat this 10000 times when the null hypothesis is true (i.e. there is no difference).

> mean(sapply(1:1000, function(x) { 
    t.test(sample(1:5, 100, TRUE), sample(1:5, 100, TRUE))$p.value 
  } ) < 0.05)

[1] 0.0499

It appears that I get a significant value 4.99% of the time. Given that I expect a significant value 5% of the time, it does not appear that violating the assumptions of normality and interval measurement has had any effect on my results - at least in terms of type I errors. (There might be power issues, of course.)

If someone has a specific criticism, you can investigate and see if it's an issue.

Here's another example: Now I have 5 people in one group, and 100 in the other.

>   mean(sapply(1:10000, function(x) { t.test(sample(1:5, 5, TRUE), sample(1:5, 100, TRUE))$p.value } ) < 0.05)
[1] 0.0733

Now I have a 7.3% type I error rate. This is probably enough to worry about.

What about 5 per group?

 mean(sapply(1:10000, function(x) { t.test(sample(1:5, 5, TRUE), sample(1:5, 5, TRUE))$p.value } ) < 0.05)

Now a 4.5% signifance rate - indicates a slight loss of power, but I prefer that (a lot) over an inflated type I error rate.

Best Answer

Related Solutions

Solved – Ordinal data in regression

Solved – Parametric tests and Likert Scales (Ordinal data) – Two different views

Related Question