Solved – Regression when the dependent variable is between 0 and 1

generalized linear modellogisticproportion;regressionscikit learn

I am using the scikit-learn library to perform regression. However in my case I need the dependent variable to be constrained in the range 0 to 1. The dependent variable represents count proportions (counts in some category divided by a total count) and is there not continuous. I can see two ways to achieve this.

  1. Transform the dependent variable to the full real number line and perform normal regression.
  2. Transform the regression problem into a categorical one by selecting n classes each representing the range (i/n) to (i+1/n).

My guess is that the first option wouldn't work well in practice and the second looks like an ugly kludge (which might work).

What is a good way to constrain the dependent variable in regression (in Python)?


Regression for an outcome (ratio or fraction) between 0 and 1 suggested using Beta regression but I don't fully understand this option. Could anyone set out what Beta regression is in technical detail for those who don't use R?

Best Answer

Beta regressions are used for continuous proportions (like the proportion of land with a particular soil type).

For count proportions, the most common models would be binomial regression models, a particular type of generalized linear model (GLM).

Of those, logistic regression is the most widely used though there's a number of other link functions that are used.

The estimated fit is automatically constrained to lie within the bounds.

It doesn't transform the response; it relies on fitting a function that stays inside the limits.

[Numerous questions on site discuss logistic regression. A few discuss other models - probit regression and complementary log-log regression, for example]

Related Question