Regression – Model for Continuous Dependent Variable Bounded Between 0 and 1

logisticregression

I'm attempting a multiple regression model where the predicted variable is runoff ratio – the ratio of watershed discharge to the precipitation input. This should generally be bounded [0,1], however, due to measurement error some values > 1 occur.

Originally, I modeled this with the predicted variable un-transformed, but logistic regression has been suggested to me, I also have heard Beta regression suggested. I'm not sure how to proceed, and if these transformations are appropriate to my data:
enter image description here

My questions are:
1) Is a logistic regression appropriate for these data? and
2) If I were to proceed with logistic regression, would I need to convert the runoff ratios to proportions, or would I apply the logit to the values as they are?

Sorry if these are obtuse questions – I'm new to logit and most of the information I have found is for binary response variables.

Edited for suggested additions:
As a simple version: I am modeling runoff ratio (rr) as an effect of precipitation (pcp) and antecedent water table position (ant):

rr ~ pcp + ant

rr is a continuous variable. I am not interested in the probability of specific values, rather I'm interested in the values themselves – both to assess the significance of the predictors and as a predictive model.

Conceptually, I was fine modeling it un-transformed. However, a simple linear regression allows predicted values outside of the physical range of [0,1]. As mentioned above, measurement error does lead to values >1, which I'll eventually have to deal with.

Best Answer

Let "$run$" be the runoff, as measured with error, so that the measured runoff ratio $rr$ is $run/pcp$. The stated model and its alternatives appear to be in the form

$$rr = \frac{run}{pcp} \sim F(\beta_{pcp} (pcp) + \beta_{ant} (ant) + \beta_0)$$

where $F$ is some family of distributions (such as Beta distributions) and the $\beta_{*}$ are coefficients to be estimated. The main problem with this is that unless the dispersion of the measurement error in $run$ is directly proportional to $pcp$, the structure of $F$ will be unnecessarily complicated. Why not algebraically rewrite the relationship as

$$run = \beta_{pcp} (pcp)^2 + \beta_{ant} (ant)(pcp) + \beta_0(pcp) + \varepsilon$$

where $\varepsilon$ represents the measurement error? The absence of several simple terms in this formula (such as one depending directly on $ant$ as well as a constant term) suggests that the proposed model may be artificially limited. Thus, ordinary regression (using $run$ or some re-expression thereof, such as a square or cube root, as the dependent variable) to fit a model like

$$run = \alpha_0 + \alpha_{pcp}(pcp) + \alpha_{ant}(ant) + \alpha_{pcp2}(pcp)^2 + \alpha_{ant,pcp}(ant)(pcp) + \varepsilon$$

would be a good way to begin an analysis. And if indeed the variance of $\varepsilon$ depends on $pcp$, that can be modeled in various straightforward ways. This approach seems more natural, realistic, and interpretable than hoping the ratio $rr$ would satisfy the more restrictive assumptions of Beta or Logistic regression.

Related Question