Multicollinearity – Multicollinearity Between Categorical and Continuous Variables

generalized linear modelmulticollinearityregression

Simplified version of my model:

glm(cbind(young, adults) ~ as.factor(month) + effort, family = "binomial")

i.e., I study proportion of young as a dependent variable on month (or season), taking into account observer effort. However, the observer effort is dependent on the month:

enter image description here

How to solve this problem? I want to take into account both variables.

I was looking into literature but haven't found any solution. My naive idea is to compute mean effort for each month and instead of taking effort, take difference between effort and this mean. But this is just a naive idea, I would like to hear your advice. Thanks!

EDIT – response to Scortchi question – no, not actually:

> m = glm(cbind(young, adults) ~ as.factor(month) + effort, family = "quasibinomial")
> summary(m)

Call:
glm(formula = cbind(young, adults) ~ as.factor(month) + effort, family = "quasibinomial")

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6829  -1.1138   0.0000   0.9717   4.0090  

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -0.7868764  0.2170738  -3.625 0.000306 ***
as.factor(month)2   0.8857561  0.2606780   3.398 0.000710 ***
as.factor(month)3   0.7055741  0.2918895   2.417 0.015843 *  
as.factor(month)4   0.3943665  0.3269973   1.206 0.228138    
as.factor(month)5   0.4831113  0.3730987   1.295 0.195713    
as.factor(month)6  -0.5217349  0.5027560  -1.038 0.299676    
as.factor(month)7   0.1612901  0.4333682   0.372 0.709851    
as.factor(month)8   0.5114890  0.3545159   1.443 0.149444    
as.factor(month)9   0.7741060  0.3126087   2.476 0.013466 *  
as.factor(month)10  0.6601093  0.2609937   2.529 0.011608 *  
as.factor(month)11  0.4891778  0.2647303   1.848 0.064967 .  
as.factor(month)12  0.4743091  0.2565709   1.849 0.064849 .  
effort              0.0032506  0.0007976   4.075 5.02e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for quasibinomial family taken to be 1.270962)

    Null deviance: 1518.0  on 878  degrees of freedom
Residual deviance: 1447.8  on 866  degrees of freedom
  (750 observations deleted due to missingness)
AIC: NA

Number of Fisher Scoring iterations: 4

Warning message:
In summary.glm(m) :
  observations with zero weight not used for calculating dispersion
> 
> 
> m = glm(cbind(young, adults) ~ as.factor(month), family = "quasibinomial")
> summary(m)

Call:
glm(formula = cbind(young, adults) ~ as.factor(month), family = "quasibinomial")

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.1142  -1.1266   0.0000   0.9235   3.6484  

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         -0.2296     0.1590  -1.444  0.14890    
as.factor(month)2    0.8610     0.2059   4.181 3.09e-05 ***
as.factor(month)3    1.0887     0.2062   5.280 1.51e-07 ***
as.factor(month)4    0.4184     0.2374   1.762  0.07822 .  
as.factor(month)5    0.1495     0.3086   0.485  0.62802    
as.factor(month)6   -0.6177     0.3872  -1.595  0.11091    
as.factor(month)7   -0.4636     0.3666  -1.265  0.20622    
as.factor(month)8    0.1089     0.2976   0.366  0.71440    
as.factor(month)9    0.4932     0.2490   1.980  0.04787 *  
as.factor(month)10   0.6322     0.2096   3.016  0.00261 ** 
as.factor(month)11   0.4919     0.2152   2.286  0.02243 *  
as.factor(month)12   0.2296     0.2127   1.079  0.28071    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for quasibinomial family taken to be 1.309215)

    Null deviance: 2400.1  on 1345  degrees of freedom
Residual deviance: 2310.4  on 1334  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 4

Warning message:
In summary.glm(m) :
  observations with zero weight not used for calculating dispersion
> 
> 
> 
> m = glm(cbind(young, adults) ~ effort, family = "quasibinomial")
> summary(m)

Call:
glm(formula = cbind(young, adults) ~ effort, family = "quasibinomial")

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-2.718  -1.119   0.000   1.011   4.236  

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.326899   0.102473  -3.190  0.00147 ** 
effort       0.003827   0.000688   5.563 3.52e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for quasibinomial family taken to be 1.268467)

    Null deviance: 1518.0  on 878  degrees of freedom
Residual deviance: 1475.4  on 877  degrees of freedom
  (750 observations deleted due to missingness)
AIC: NA

Number of Fisher Scoring iterations: 4

Warning message:
In summary.glm(m) :
  observations with zero weight not used for calculating dispersion

Best Answer

I'm converting a previous comment to an answer, expanding a bit based on a follow-up comment from the OP. The original, unedited comment was:

There is no silver bullet for decomposing variation in that situation. One thing you can do with two collinear predictors, $x_1,x_2$, is fit a model $x_1 \sim x_2$, take the residuals from that model, $η$, and replace $x_1$ with $η$ in the model $y \sim x_1+x_2$. This way, you will, definitionally, have uncorrelated predictors and the contribution of η is thought of as the variance explained by $x_1$ that is not subsumed by $x_2$. Of course, which variable is $x_1$ and which is $x_2$ is a judgment call (though the overall model fit will be identical).

In response to the OP's comment:

@Macro, this is a nice thing... maybe worth posting an answer, so we can discuss it with more detail? This is very interesting, because then $x_1=x_2+η$, and if you replace the x1 with η in the original model, you get $y \sim η+x_2=x_1$, which means you loose $x_2$ for the overall fit of the model! And this is strange, paradox! Please post your comment as an answer to discuss it in more detail.

Be careful here, because $x_1 \sim x_2$ is R pseudo-code for the model $x_1 = \beta_0 + \beta_1 x_2 + \eta$, not $x_1 = x_2 + \eta$. So, by my back-of-the-envelope calculation, this means that the model $y \sim \eta + x_2$, which is short hand for $y = \alpha_0 + \alpha_1 \eta + \alpha_2 x_2 + \varepsilon$, can be written as

$$ y = (\alpha_0 - \alpha_1 \beta_0) + \alpha_1 x_1 + (\alpha_2 - \alpha_1 \beta_1) x_2 + \varepsilon $$

So $x_2$ does not drop out of the model. Indeed the model $y \sim \eta + x_2$ can be seen to have identical degrees of freedom, fit statistics, etc. to the model $y \sim x_1 + x_2$, but the predictors are now uncorrelated.

Related Question