Simplified version of my model:
glm(cbind(young, adults) ~ as.factor(month) + effort, family = "binomial")
i.e., I study proportion of young as a dependent variable on month (or season), taking into account observer effort. However, the observer effort is dependent on the month:
How to solve this problem? I want to take into account both variables.
I was looking into literature but haven't found any solution. My naive idea is to compute mean effort for each month and instead of taking effort, take difference between effort and this mean. But this is just a naive idea, I would like to hear your advice. Thanks!
EDIT – response to Scortchi question – no, not actually:
> m = glm(cbind(young, adults) ~ as.factor(month) + effort, family = "quasibinomial")
> summary(m)
Call:
glm(formula = cbind(young, adults) ~ as.factor(month) + effort, family = "quasibinomial")
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6829 -1.1138 0.0000 0.9717 4.0090
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.7868764 0.2170738 -3.625 0.000306 ***
as.factor(month)2 0.8857561 0.2606780 3.398 0.000710 ***
as.factor(month)3 0.7055741 0.2918895 2.417 0.015843 *
as.factor(month)4 0.3943665 0.3269973 1.206 0.228138
as.factor(month)5 0.4831113 0.3730987 1.295 0.195713
as.factor(month)6 -0.5217349 0.5027560 -1.038 0.299676
as.factor(month)7 0.1612901 0.4333682 0.372 0.709851
as.factor(month)8 0.5114890 0.3545159 1.443 0.149444
as.factor(month)9 0.7741060 0.3126087 2.476 0.013466 *
as.factor(month)10 0.6601093 0.2609937 2.529 0.011608 *
as.factor(month)11 0.4891778 0.2647303 1.848 0.064967 .
as.factor(month)12 0.4743091 0.2565709 1.849 0.064849 .
effort 0.0032506 0.0007976 4.075 5.02e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasibinomial family taken to be 1.270962)
Null deviance: 1518.0 on 878 degrees of freedom
Residual deviance: 1447.8 on 866 degrees of freedom
(750 observations deleted due to missingness)
AIC: NA
Number of Fisher Scoring iterations: 4
Warning message:
In summary.glm(m) :
observations with zero weight not used for calculating dispersion
>
>
> m = glm(cbind(young, adults) ~ as.factor(month), family = "quasibinomial")
> summary(m)
Call:
glm(formula = cbind(young, adults) ~ as.factor(month), family = "quasibinomial")
Deviance Residuals:
Min 1Q Median 3Q Max
-3.1142 -1.1266 0.0000 0.9235 3.6484
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.2296 0.1590 -1.444 0.14890
as.factor(month)2 0.8610 0.2059 4.181 3.09e-05 ***
as.factor(month)3 1.0887 0.2062 5.280 1.51e-07 ***
as.factor(month)4 0.4184 0.2374 1.762 0.07822 .
as.factor(month)5 0.1495 0.3086 0.485 0.62802
as.factor(month)6 -0.6177 0.3872 -1.595 0.11091
as.factor(month)7 -0.4636 0.3666 -1.265 0.20622
as.factor(month)8 0.1089 0.2976 0.366 0.71440
as.factor(month)9 0.4932 0.2490 1.980 0.04787 *
as.factor(month)10 0.6322 0.2096 3.016 0.00261 **
as.factor(month)11 0.4919 0.2152 2.286 0.02243 *
as.factor(month)12 0.2296 0.2127 1.079 0.28071
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasibinomial family taken to be 1.309215)
Null deviance: 2400.1 on 1345 degrees of freedom
Residual deviance: 2310.4 on 1334 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 4
Warning message:
In summary.glm(m) :
observations with zero weight not used for calculating dispersion
>
>
>
> m = glm(cbind(young, adults) ~ effort, family = "quasibinomial")
> summary(m)
Call:
glm(formula = cbind(young, adults) ~ effort, family = "quasibinomial")
Deviance Residuals:
Min 1Q Median 3Q Max
-2.718 -1.119 0.000 1.011 4.236
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.326899 0.102473 -3.190 0.00147 **
effort 0.003827 0.000688 5.563 3.52e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasibinomial family taken to be 1.268467)
Null deviance: 1518.0 on 878 degrees of freedom
Residual deviance: 1475.4 on 877 degrees of freedom
(750 observations deleted due to missingness)
AIC: NA
Number of Fisher Scoring iterations: 4
Warning message:
In summary.glm(m) :
observations with zero weight not used for calculating dispersion
Best Answer
I'm converting a previous comment to an answer, expanding a bit based on a follow-up comment from the OP. The original, unedited comment was:
There is no silver bullet for decomposing variation in that situation. One thing you can do with two collinear predictors, $x_1,x_2$, is fit a model $x_1 \sim x_2$, take the residuals from that model, $η$, and replace $x_1$ with $η$ in the model $y \sim x_1+x_2$. This way, you will, definitionally, have uncorrelated predictors and the contribution of η is thought of as the variance explained by $x_1$ that is not subsumed by $x_2$. Of course, which variable is $x_1$ and which is $x_2$ is a judgment call (though the overall model fit will be identical).
In response to the OP's comment:
@Macro, this is a nice thing... maybe worth posting an answer, so we can discuss it with more detail? This is very interesting, because then $x_1=x_2+η$, and if you replace the x1 with η in the original model, you get $y \sim η+x_2=x_1$, which means you loose $x_2$ for the overall fit of the model! And this is strange, paradox! Please post your comment as an answer to discuss it in more detail.
Be careful here, because $x_1 \sim x_2$ is R pseudo-code for the model $x_1 = \beta_0 + \beta_1 x_2 + \eta$, not $x_1 = x_2 + \eta$. So, by my back-of-the-envelope calculation, this means that the model $y \sim \eta + x_2$, which is short hand for $y = \alpha_0 + \alpha_1 \eta + \alpha_2 x_2 + \varepsilon$, can be written as
$$ y = (\alpha_0 - \alpha_1 \beta_0) + \alpha_1 x_1 + (\alpha_2 - \alpha_1 \beta_1) x_2 + \varepsilon $$
So $x_2$ does not drop out of the model. Indeed the model $y \sim \eta + x_2$ can be seen to have identical degrees of freedom, fit statistics, etc. to the model $y \sim x_1 + x_2$, but the predictors are now uncorrelated.