Solved – Fitting a truncated normal distribution between 0 and 1 with a logit link function in R

generalized linear modelproportion;rregressiontruncated normal distribution

I am modeling data that are proportions between 0 and 1, with many values of 0 and 1 in the data set. I'd like to model this as a truncated normal between 0-1, using a logit link in glm for R. I have two questions:

  1. Is a truncated normal with logit link via glm appropriate for proportion data with many real 0 and 1s?

  2. How would I go about specifying a truncated normal family within glm for R?

Note: Beta regression is not appropriate, as it does not like 0s or 1s.

Best Answer

Given that these are data are proportion of two integers, it makes much more sense to use binomial logistic regression on these data. Logistic regression is not just for the Bernoulli distribution (binary 0/1).

The advantage of this approach is especially apparent if you have different denominators in that the proportion 20/100 has much more information about the binomial parameter $p$ than the proportion 2/10, even though both have the same decimal proportion. Binomial is also the natural family for integer valued proportions. In R this is accomplished by providing your response variable as the binned odds. Here x is the numerator and y is the denominator of each proportion:

x = c(3, 12, 6, 8, 10)
y = c(6, 17, 14, 20, 23)
bin = cbind(x, y-x)
glm(bin ~ NULL, family =  "binomial")

However, there is one additional consideration for using the binomial distribution in logistic regression. For the binomial family, the variance is related to the mean through the binomial probability parameter, $p$. It is possible, even common, that your residuals will demonstrate more variation than expected from the binomial distribution. (Not something you have to worry about with binary inputs to logistic regression). This overdispersion can be due to not having enough covariate information to explain the data (e.g. there are missing predictors) or simply because nature doesn't like the to follow the rules. The consequence of not accounting for overdispersion is that your standard errors will be too small. You can accommodate this overdispersion in your model by using quasi-likelihood and choosing the "quasibinomial" family in glm. Eg:

glm(bin ~ NULL, family =  "quasibinomial")

We can check for overdispersion by running the summary function on the above model. If we do that, we'll see a line that says:

(Dispersion parameter for quasibinomial family taken to be 1.078017)

That's pretty close to 1 for our made up data, so here we would choose to use the family ="binomial" and not worry about overdispersion.

Related Question