Logistic Regression – Validity of Fitting Logistic Regression with Proportional Dependent Variable

logisticproportion;regression

Several posts (here and here) suggest that beta regression is more appropriate when the dependent variable is naturally bounded between 0 and 1. My question is, leaving appropriateness aside, is it technically incorrect to fit a logistic regression to proportional response variable? R will throw a warning but still produce a result.

It seems to me that the likelihood function will not be a valid likelihood when the response variable is proportional instead of binary, but mathematically speaking, it can still be minimized to give a solution. I wonder what violation/mistake, if any, is made when fitting a logistic regression to proportional data.

Best Answer

What you propose is sometimes called a fractional logit. It certainly has its merits, as long as you remember to use robust standard errors. In 2010 I gave a talk at the German Stata Users' meeting comparing among other things beta regression and fractional logit. The slides can be found here: http://www.maartenbuis.nl/presentations/berlin10.pdf

Related Solutions

Solved – Rare event logistic regression bias: how to simulate the underestimated p’s with a minimal example

This is an interesting question - I'm have done a few simulations that I post below in the hope that this stimulates further discussion.

First of all, a few general comments:

The paper you cite is about rare-event bias. What was not clear to me before (also with respect to comments that were made above) is if there is anything special about cases where you have 10/10000 as opposed to 10/30 observations. However, after some simulations, I would agree there is.
A problem I had in mind (I have encountered this often, and there was recently a paper in Methods in Ecology and Evolution on that, I couldn't find the reference though) is that you can get degenerate cases with GLMs in small-data situations, where the MLE is FAAAR away from the truth, or even at - / + infinity (due to the nonlinear link I suppose). It's not clear to me how one should treat these cases in the bias estimation, but from my simulations I would say they seem key for the rare-event bias. My intuition would be to remove them, but then it's not quite clear how far out they have to be to be removed. Maybe something to keep in mind for bias-correction.
Also, these degenerate cases seem prone to cause numerical problems (I have therefore increased maxit in the glm function, but one could think about increasing epsilon as well to make sure one actually reports the true MLE).

Anyway, here some code that calculates the difference between estimates and truth for intercept, slope and predictions in a logistic regression, first for a low sample size / moderate incidence situation:

set.seed(123)
replicates = 1000
N= 40
slope = 2 # slope (linear scale)
intercept = - 1 # intercept (linear scale)

bias <- matrix(NA, nrow = replicates, ncol = 3)
incidencePredBias <- rep(NA, replicates)

for (i in 1:replicates){
  pred = runif(N,min=-1,max=1) 
  linearResponse = intercept + slope*pred
  data = rbinom(N, 1, plogis(linearResponse))  
  fit <- glm(data ~ pred, family = 'binomial', control = list(maxit = 300))
  bias[i,1:2] = fit$coefficients - c(intercept, slope)
  bias[i,3] = mean(predict(fit,type = "response")) - mean(plogis(linearResponse))
}

par(mfrow = c(1,3))
text = c("Bias intercept", "Bias slope", "Bias prediction")

for (i in 1:3){
  hist(bias[,i], breaks = 100, main = text[i])
  abline(v=mean(bias[,i]), col = "red", lwd = 3)  
}

apply(bias, 2, mean)
apply(bias, 2, sd) / sqrt(replicates)

The resulting bias and standard errors for intercept, slope and prediction are

-0.120429315  0.296453122 -0.001619793
 0.016105833  0.032835468  0.002040664

I would conclude that there is pretty good evidence for a slight negative bias in the intercept, and a slight positive bias in the slope, although a look at the plotted results shows that the bias is small compared the the variance of the estimated values.

If I'm setting the parameters to a rare-event situation

N= 4000
slope = 2 # slope (linear scale)
intercept = - 10 # intercept (linear scale)

I'm getting a larger bias for the intercept, but still NONE on the prediction

   -1.716144e+01  4.271145e-01 -3.793141e-06
    5.039331e-01  4.806615e-01  4.356062e-06

In the histogram of the estimated values, we see the phenomenon of degenerate parameter estimates (if we should call them like that)

Let's remove all rows for which intercept estimates are <20

apply(bias[bias[,1] > -20,], 2, mean)
apply(bias[bias[,1] > -20,], 2, sd) / sqrt(length(bias[,1] > -10))

The bias decreases, and things become a bit more clear in the figures - parameter estimates are clearly not normally distributed. I wonder that that means for the validity of the CIs that are reported.

-0.6694874106  1.9740437782  0.0002079945
1.329322e-01 1.619451e-01 3.242677e-06

I would conclude the rare event bias on the intercept is driven by rare events itself, namely those rare, extremely small estimates. Not sure if we want to remove them or not, not sure what the cutoff would be.

An important thing to note though is that, either way, there seems to be no bias on predictions at the response scale - the link function simply absorbs these extremely small values.

R Regression – Using Percentage Response Variable When Counts Are Unavailable

One of the following three solutions might work for you. However, I am curious what other will suggest:

You can use simple linear regression. However, that procedure might violate some assumptions of linear regression (depends on your actual data). Inferential statistics such as p-values and/or confidence-bands might not be trustworthy. Moreover, your model might predict scores outside of the boundary, which makes interpretations difficult.
You can transform the percentage scores into logits and use them as outcome for the linear regression. Here ist the transformation formula: ln(p/(1-p)) By doing that, you adapt the link-function of logistic regression to the linear regression. That might solve some of the previous problems especially the last one because logits boundaries are -infinite and +infinite. However, you loose interpretability. (Edit: A short discussion of this approach and why it is not recommended can be found in the source linked for the third solution)
Beta-Regression might be the model you are looking for. The following vignette shows how to apply beta regression in R using Cribari-Neto's and Zeileis's "betareg"-Package: ftp://cran.r-project.org/pub/R/web/packages/betareg/vignettes/betareg.pdf

I hope that some of my suggestions might help you!

Best Answer

Related Solutions

Solved – Rare event logistic regression bias: how to simulate the underestimated p’s with a minimal example

R Regression – Using Percentage Response Variable When Counts Are Unavailable

Related Question