Solved – Logistic regression for a continuous dependent variable

beta-regressiongeneralized linear modellogisticregression

I am trying to model a response variable, the weight of a variable (can't be thought of a binomial distribution as it involves no success/failures), that falls between 0 and 1. That is, the response variable always has to lie between 0 and 1. Here are my alternatives:

  1. Do a logit transformation of the response variable and fit a linear regression:

    $${\rm logit}(Y) \sim β_0 + β_1X_1 + \ldots$$

  2. GLM with a logit link function

What is the difference between the two approaches? In this case, the predictive accuracy of the model is more important than the interpretability of the variables.

Best Answer

Your first option could work. It assumes that the residuals from the model on the transformed data are normally distributed. You need to check this. If it's true, you will be OK.

Option 2 depends on how you set up the GLM. Simply using a logit link function does not necessitate that you use any particular response distribution. Certainly, the logit link is most commonly used with a binomial distribution, but it doesn't need to be. I assume you are thinking about something like using a normal distribution for the response with a logit link. If so, that probably wouldn't be a great choice, as the normal distribution assumes the data are unbounded, but yours are not. For example, the positive residuals you could have could only exist in the interval $(1-\hat\mu,\ 0)$, whereas the negative residuals could only exist in $(0,\ 0-\hat\mu)$; it is very likely you would have heteroscedastic residuals with differing skews. Even if not, they could not possibly be normal. The effect that will have on the predictive ability of the model is unclear to me, but I just wouldn't go this route.

My guess is that your best be may be to use Beta regression. The Beta distribution is very flexible and should typically be the best choice for continuous proportions. Note however, that it is possible to have data bounded by 0 and 1 that do not fit any Beta distribution, so you again need to check if it's sensible.

Related Question