I have some data in [0,1] which I would like to analyze with a beta regression.
Of course something needs to be done to accommodate the 0,1 values. I dislike
modifying data to fit a model. also I don't believe that zero and 1 inflation
is a good idea because I believe in this case one should consider the 0's
to be very small positive values (but I don't want to say exactly what
value is appropriate. A reasonable choice I believe would be to pick small values
like .001 and .999 and to fit the model using the cumulative dist for the beta.
So for observations y_i the log likelihood LL_iwould be
if y_i < .001 LL+=log(cumd_beta(.001))
else if y_i>.999 LL+=log(1.0-cum_beta(.999))
else LL+=log(beta_density(y_i))
What I like about this model is that if the beta regression model is valid
this model is also valid, but it removes a bit of the sensitivity to the
extreme values. However this seems to be such a natural approach that
I wonder why I don't find any obvious references in the literature.
So my question is instead of modifying the data, why not modify the model.
Modifying the data biases the results (based on the assumption that the original model is valid), whereas modifying the model by binnning the extreme values does not bias the results.
Maybe there is a problem I am overlooking?
Best Answer
According to Smithson & Verkuilen (2006)$^1$, an appropriate transformation is
$$ x' = \frac{x(N-1) + s}{N} $$
This will squeeze data that lies in $[0,1]$ to be in $(0,1)$. The above quote, and a mathematical reason of the transformation is available in the [paper's supplementary notes].
Reference: