I find a lot of resources online which explains the dummy variable trap and that you should remove 1 category of your dummy variable before fitting it into a multilinear model to avoid multicollinearity.
While I understand what you should do I don't understand why you should do it in term of mathematical explanation.
I mean, let's take a concrete example: I have a variable Gender
with values Male
or Female
. If I take the multilinear model equation I get:
$$y = B_0 + x_1B_1 + x_2B_2$$
with $x_1 = 1$ and $x_2 = 0$. So I get: $y=0+1\times1 + 0\times1$ so how is it different from $y=0+1\times1$ (which the second dummy variable removed ?
Could someone give me a concrete mathematical example of how this "trap" works?
Thanks
Best Answer
We can see from Wikipedia that:
In your case, that means that
$$x_1 = 1-x_2$$ and hence, your equation becomes \begin{align} y &= B_0 + x_1B_1 + x_2B_2 \\ &= B_0 + B_1(1-x_2) + B_2 x_2 \\ &= (B_0+B_1) + (B_2-B_1)x_2 \end{align}
It is obvious that $(B_0+B_1) + (B_2-B_1)x_2$ is equivalently $\alpha + \beta x$ which entails only one variable.
Reference: Dummy Variable Trap