Solved – Linear regression and ordinal data

likertordinal-dataregression

I am running a multiple regression looking at whether TV viewing predicts waist circumference (WC). When I ran through the tests with my tutor we placed WC as the dependant and TV as independent, then ran it again with some of the potential confounders. However, the TV variable is ordinal (1- never, 2- 0 to 59 mins, 3- 1 to 2 hours, 4- 2 to 3 hours, 5- 3 to 4 hours, 6- 4 to 5 hours, and 7- 5+ hours). Should I be recoding these as dummy variables? I also plan to run regressions with similarly collected snack food consumption ordinal data. I'm hesitant to do it differently to my tutor (who is away for 2 weeks), and think it may be too many dummy variables, but is it still ok to interpret the results as I would a continuous dataset (in terms of mins/hours)?
Many thanks, Nick.

Best Answer

Entering your 7 levels as dummy variables would be more appropriate to the ordinal level of measurement, but there are a few caveats to consider:

  1. This would mean estimating more parameters, which means fewer degrees of freedom and more risk of overfitting. These risks depend partly on your sample size. If you need to add other predictors and their interactions, that could compound these problems.
  2. It would complicate interpretation and testing of any simple linear or curvilinear hypotheses you might have in mind for the relationship between TV and WC.
  3. Your data aren't as messy as ordinal data can be. It seems you have a continuous time variable that may be zero-inflated (if there are plenty of of "never"s) and is right-censored (if time > 5h, you have no information on how much greater, right?) and binned. The bins are evenly spaced though, and you have at least five levels that correspond neatly to a continuum, so you may not lack too much of the information you might otherwise have with continuous measurements. Bollen and Barb (1981) found that pentachotomizing two continuous variables (i.e., splitting them into five groups, and yeah, I just made up that word) attenuated their simple correlation by about 9%, which isn't terrible compared to binning with fewer levels. If your WC measurements are continuous, then treating just this one binned variable as a continuous predictor might work okay.

Overfitting can be resisted and detection of a simple trend facilitated somewhat by using penalized regression to smooth the dummy coefficients (i.e., reducing differences in adjacent levels of TV). E.g., if the regression coefficients for the dummy variables corresponding to levels {3,4,5} are {.2,.5,.3} in a OLS model, they might be {.25,.43,.315} in a penalized regression model. If the reference level really is more different from level 4 than from levels 3 and 5, smoothing might not improve your predictions, but if the relationship between TV and WC in the population is actually monotonic, predictions would improve with reduction (if not elimination) of the spurious bump in the trend at level 4 (in this example). I always suggest Gertheiss and Tutz (2009) for an overview of penalized regression.


References
Bollen, K. A., & Barb, K. H. (1981). Pearson's $r$ and coarsely categorized measures. American Sociological Review, 46, 232–239. Retrieved from http://www.statpt.com/correlation/bollen_barb_1981.pdf.
Gertheiss, J., & Tutz, G. (2009). Penalized regression with ordinal predictors. International Statistical Review, 77(3), 345–365. Retrieved from http://epub.ub.uni-muenchen.de/2100/1/tr015.pdf.

Related Question