Solved – Dealing with 0,1 values in a beta regression

beta distributionbeta-regressiongeneralized linear modelregressionzero inflation

I have some data in [0,1] which I would like to analyze with a beta regression.
Of course something needs to be done to accommodate the 0,1 values. I dislike
modifying data to fit a model. also I don't believe that zero and 1 inflation
is a good idea because I believe in this case one should consider the 0's
to be very small positive values (but I don't want to say exactly what
value is appropriate. A reasonable choice I believe would be to pick small values
like .001 and .999 and to fit the model using the cumulative dist for the beta.
So for observations y_i the log likelihood LL_iwould be

 if  y_i < .001   LL+=log(cumd_beta(.001))
 else if y_i>.999  LL+=log(1.0-cum_beta(.999))
 else LL+=log(beta_density(y_i))

What I like about this model is that if the beta regression model is valid
this model is also valid, but it removes a bit of the sensitivity to the
extreme values. However this seems to be such a natural approach that
I wonder why I don't find any obvious references in the literature.
So my question is instead of modifying the data, why not modify the model.
Modifying the data biases the results (based on the assumption that the original model is valid), whereas modifying the model by binnning the extreme values does not bias the results.

Maybe there is a problem I am overlooking?

Best Answer

According to Smithson & Verkuilen (2006)$^1$, an appropriate transformation is

$$ x' = \frac{x(N-1) + s}{N} $$

"where N is the sample size and s is a constant between 0 and 1. From a Bayesian standpoint, s acts as if we are taking a prior into account. A reasonable choice for s would be .5."

This will squeeze data that lies in $[0,1]$ to be in $(0,1)$. The above quote, and a mathematical reason of the transformation is available in the [paper's supplementary notes].


Reference:
  1. Smithson, M. & Verkuilen, J. A better lemon squeezer? Maximum-likelihood regression with beta-distributed dependent variables. Psychol. Methods 11, 54–71 (2006). DOI: 10.1037/1082-989X.11.1.54