Solved – The dumthe variable trap

categorical-encodinglinearmultiple regressionregression

I find a lot of resources online which explains the dummy variable trap and that you should remove 1 category of your dummy variable before fitting it into a multilinear model to avoid multicollinearity.
While I understand what you should do I don't understand why you should do it in term of mathematical explanation.
I mean, let's take a concrete example: I have a variable Gender with values Male or Female. If I take the multilinear model equation I get:

$$y = B_0 + x_1B_1 + x_2B_2$$

with $x_1 = 1$ and $x_2 = 0$. So I get: $y=0+1\times1 + 0\times1$ so how is it different from $y=0+1\times1$ (which the second dummy variable removed ?
Could someone give me a concrete mathematical example of how this "trap" works?
Thanks

Best Answer

We can see from Wikipedia that:

Multicollinearity refers to a situation in which two or more explanatory variables in a multiple regression model are highly linearly related.

In your case, that means that
$$x_1 = 1-x_2$$ and hence, your equation becomes \begin{align} y &= B_0 + x_1B_1 + x_2B_2 \\ &= B_0 + B_1(1-x_2) + B_2 x_2 \\ &= (B_0+B_1) + (B_2-B_1)x_2 \end{align}

It is obvious that $(B_0+B_1) + (B_2-B_1)x_2$ is equivalently $\alpha + \beta x$ which entails only one variable.

Reference: Dummy Variable Trap

Related Question