I am trying to build a regression model where I have 25 independent variables(predictors) all of which 5 point Likert items and 1 dependent variable which is a mean score of a 7 point Likert scale (aggregated score). I need to filter the best possible predictors(variable selection) from these 25. I was wondering what type of regression should I use, linear or ordinal ?
Solved – What type of regression analysis should I use for data generated by using Likert-scale question items
likertmodelingordinal-dataregression
Related Solutions
Your decision appeares inconsistent. By virtue of summing up 6 rating items into the total score you discard your attitude to those six as ordinal level-of-measurement scales; instead, you treat them as interval ones. On the other hand, you disbelieve the total score to be interval - as seen in you wish to categorize it and input to ordinal regression. That means that you fully disregard your done action of summing up. Why then you chose just summing and not some other way of combining?
It is generally fine to use predictor and outcome variables that use different metrics when performing multiple regression.
To demonstrate the point, you can rescale predictor or dependent variables using a linear transformation (.e.g., z-scores, centering, and so on) and this will not influence your $R^2$ or your standardised regression coefficients (note that I'm not saying you should do this, I'm just pointing out that this aspect of scaling is not the issue). Of course, using 4 or 7 point response scales is more than just rescaling, but from my experience, correlations and $R^2$ wont change a lot based on whether you use a 4 or 7 point scale.
That said, there several issues to consider when you have predictor or dependent variables that are single item variables with a small number of ordered response options:
- What is the best response scale for measuring the variable of interest? If you are designing a study, then you may want to think about the optimal number of response options. There are a range of debates about this. Some people argue that you should have more response options (e.g., like a 7 or 10 point scale). Others suggest that you should align the set of response options to the meaningful distinctions that respondents are able to make, and that too many response options can lead to more person-specific anchoring effects; such arguments are often used to justify 5 point scales.
- What is the best way to measure the variable of interest? If you truly have a single item measure on a four or seven point scale, you would often be better served by developing a scale with multiple items that you then sum to form an overall measure. This will tend to be more reliable and lead to more discrimination. Both of these factors may result in improved prediction.
- Can you include an item with four ordered response options as a dependent variable in a linear regression? There are different answers to this. Certainly, it is possible, and many people do this. Of course the residuals wont be normally distributed, and it assumes that you are happy treating the categories of the response option as equally-distant. There are alternative techniques that attempt to more explicitly model ordinal data (such as ordinal logistic regression). In practice, as the number of categories increases, people are generally more willing to perform linear regression. Thus, if your dependent variable was the sum of a few items all on a four point scale, it would seem more appropriate. Four options on a single item is on the low-side.
- Can you include an item with seven ordered response options as a predictor variable in linear regression? Yes, this is fine. There are a many options regarding how you numerically code the variable. The standard approach would be to treat the categories as equally distant. Of course, you could explore other codings (there's even optimal scaling which attempt to optimise the coding of the variable subject to any constraints such as ordinality). Or you could include both a linear and quadratic coding for the variable to incorporate non-linearity of effect.
Note most of the above was written on the initial assumption that your predictor and outcome variable were single items. If you have multi-item scales that just happen to use different response scales, then there's not too much to think about. Most people treat such scales as standard numeric variables in their multiple regressions.
Best Answer
I think I get it, too many questions. However, obtaining answers to them is important for a good recommendation.
One approach to answering your regression question would be to use the Lasso, a regularizing method, for variable selection. That said, every statistician and their sibling has a "favorite" variable selection method. The Lasso has the advantage of being called out by Larry Wasserman on his defunct Normal Deviate blog as one of the 10 best contributions to statistics in the last 10 or 20 years. The Lasso would reduce 25 variables down to a more manageable fewer number.
Then, there are plenty of heuristics for ranking variables by their relative importance, i.e., identifying the "drivers." A bad choice to avoid is using the betas or regression coefficients since they are not scale invariant. A better choice would be to rank the absolute values of the t-statistics associated with each variable. An "optimal" choice to relative variable importance would be to read Ulrike Groemping's papers on this area of statistical modeling and implement her own approach called RELAIMPO... https://prof.beuth-hochschule.de/groemping/relaimpo/.