Solved – OLS with ordinal dependent variable – do the coefficients mean anything

categorical datainterpretationordinal-dataregressionstandardization

I currently read a paper in which the author has asked people 3 different questions regarding their life satisfaction, all of which are to be rated on a four point scale: 1) very low, 2) low, 3) high, 4) very high. The author then takes the average of the answers to the three questions for each individual and then uses this individual average as dependent variable in an OLS regression with binary and continuous explanatory variables.

This does not make sense to me from an interpretation point of view. What does $\beta = 0.12$ tell me in this case given the nature of the dependent variable?

So here are my other questions:

  1. Is OLS even unbiased and consistent for such outcome variables?
  2. Would it be possible to first standardize the answers into the unit interval and then take the average to form a measure of life satisfaction?

For the second question I thought it might make sense to standardize the answers $j$ for individual $i$ as
$$\tilde{X}_{i} = \frac{X_{ij}-X_{min}}{X_{max}-X_{min}}$$
and then take the average of that, such that
$$\overline{\tilde{X}}_{i} = \frac{1}{N}\sum^N_{i=1}\tilde{X}_{i}$$
could be used as the dependent variable. Given that this measure of life satisfaction is between 0 and 1 this should give more interpretable OLS parameters, right?

Thanks in advance.

Best Answer

Interpretive issues for the OLS estimator notwithstanding, the real issue here is in the treatment of an ordinal variable as if it were a variable on the ratio scale. By using standard linear regression analysis, the researchers are essentially treating the ordinal response as if it were a continuous quantity. By averaging three ratings they are also implicitly treating these life satisfaction measures as continuous measures of equal weighting in a continuous aggregated measure. This involves a lot of potentially dubious assumptions about the nature of the rating scale, so you could reasonably be skeptical of the legitimacy of this measure. At a minimum, such a treatment obscures a great deal of information in the specific effects of the explanatory variables on the ordinal categories in the individual response measures.

In any case, if we let $\bar{Y}$ denote the response variable in this case (i.e., the average of the three ratings for life satisfaction) then we have a model of the form:

$$\bar{Y}_i = u(\boldsymbol{\beta}, \mathbf{x}_i) + \varepsilon_i,$$

where the true regression function has the linear form:

$$u(\boldsymbol{\beta}, \mathbf{x}_i) = \beta_0 + \beta_1 x_{i,1} + \cdots + \beta_K x_{i,k}.$$

As usual, each slope coefficient $\beta_k$ (with $k=1,...,K$) is the rate-of-change of the conditional expected response with respect to the corresponding explanatory variable:

$$\beta_k = \frac{\partial u}{\partial x_{i,k}} (\boldsymbol{\beta}, \mathbf{x}_i).$$

As you can see, the coefficient values in the regression look at rates-of-change of the conditional expected value of the averaged life-satisfaction rating, which you may or may not regard as a dubious measure. The fact that all individual life-satisfaction ratings are ordinal integer values means that the averaged value is restricted to the support $\{ 1, \tfrac{4}{3}, \tfrac{5}{3}, \cdots , \tfrac{11}{3}, 4 \}$, and so the expected value is a convex combination of these possible values.


With regard to your follow-on questions: (1) the OLS estimator is unbiased and consistent (under broad limiting conditions on the explanatory variables) for the true coefficient values in the model, which in this case may be of dubious meaning to begin with; and (2) standardisation of the response values will merely transform them via a linear transformation, which will alter all the slope coefficients by the corresponding linear transformation; it does not fundamentally change the information coming out of the model.

Related Question