I've only just started learning logistic regression so bear with me. As I understand it, the logistic regression model is as follows:
$$\log\left(\frac{p(X)}{1-p(X)}\right) = \beta_0 + \beta_1X $$
where p(X) = the proportion of successes.
Please tell if I'm wrong but isn't the logistic regression model effectively a linear regression model with a logit link transformation of the binomial response?
I ran into a problem when I was playing around with a data frame.
I'm trying to fit a logistic regression model to a data set with two explanatory variables, water (binary) & concentration (continuous) and a binomial response variable (no. of successes from no. of trials).
> trout_new
Concentration Water Total Successes prop logit
47 2880 0 143 105 0.73426573 1.0163742
5 180 1 68 4 0.05882353 -2.7725887
7 180 1 109 11 0.10091743 -2.1870722
41 1440 0 100 14 0.14000000 -1.8152900
35 360 0 61 1 0.01639344 -4.0943446
34 360 0 145 21 0.14482759 -1.7757591
44 1440 0 113 3 0.02654867 -3.6018681
39 720 0 99 40 0.40404040 -0.3886580
4 90 1 122 9 0.07377049 -2.5301632
14 720 1 87 3 0.03448276 -3.3322045
11 360 1 129 9 0.06976744 -2.5902672
17 1440 1 140 60 0.42857143 -0.2876821
Column prop
is the proportion of successes to Total and column logit
is the log odds of success which was calculated as log(prop/(1-prop))
.
I fit a logistic regression model as such:
> trout_model <- glm(cbind(Successes,Total-Successes) ~ Water +
Concentration, family = binomial,
data = trout_new)
> summary(trout_model)
Call:
glm(formula = cbind(Successes, Total - Successes) ~ Water +
Concentration,
family = binomial, data = trout_new)
Deviance Residuals:
Min 1Q Median 3Q Max
-6.9030 -2.4959 -0.5221 1.5977 6.8773
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.820e+00 1.856e-01 -15.193 <2e-16 ***
Water 2.404e-01 1.728e-01 1.391 0.164
Concentration 1.245e-03 9.557e-05 13.029 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 375.75 on 11 degrees of freedom
Residual deviance: 141.66 on 9 degrees of freedom
AIC: 195.51
Number of Fisher Scoring iterations: 5
Then I fit a linear model to log odds of successes, I tried glm(…, family = gaussian) and lm().
> trout_model1 <- glm(logit~Water + Concentration,
family = gaussian, data = trout_new)
> summary(trout_model1)
Call:
glm(formula = logit ~ Water + Concentration, family = gaussian,
data = trout_new)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.12774 -0.57066 0.08564 0.70799 1.99285
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.2888858 0.8145197 -4.038 0.00294 **
Water 0.3817345 0.8213712 0.465 0.65315
Concentration 0.0012602 0.0005232 2.409 0.03932 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 1.615836)
Null deviance: 24.689 on 11 degrees of freedom
Residual deviance: 14.543 on 9 degrees of freedom
AIC: 44.361
Number of Fisher Scoring iterations: 2
> trout_model2<- lm(logit~Water + Concentration, data = trout_new)
> summary(trout_model2)
Call:
lm(formula = logit ~ Water + Concentration, data = trout_new)
Residuals:
Min 1Q Median 3Q Max
-2.12774 -0.57066 0.08564 0.70799 1.99285
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.2888858 0.8145197 -4.038 0.00294 **
Water 0.3817345 0.8213712 0.465 0.65315
Concentration 0.0012602 0.0005232 2.409 0.03932 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.271 on 9 degrees of freedom
Multiple R-squared: 0.411, Adjusted R-squared: 0.2801
F-statistic: 3.14 on 2 and 9 DF, p-value: 0.09239
The coefficients from the linear regression of the log odds (logit transformation) is different to that of logistic regression. Can someone explain why they are different? Is there more to logistic regression that I'm overlooking? Does the difference lie in the Maximum Likelihood vs Least Squares functions?
If the log odds of a response equate to a linear combination of parameters as suggested by textbooks. How can I model this in R using the lm()
function or is that not possible?
Best Answer
Three reasons come to mind.
For that second one, consider how you would define the log-odds in a table like the one generated by the following code.