Regression – How to Handle Multicollinearity in Linear Regression with Dummy Variables

multicollinearityregression

First, a little background:

I'm a college paintball coach and I'm trying to identify which of my players have the biggest impact on various statistics (e.g. winning percentage etc).

In order to do this I've put together a csv file with the following columns for each point that we play (you can use hockey as a mental model for how this works and each point as a "shift"):

Win/Loss/Draw (-1 for loss, 0 for draw, 1 for win)
Dummy variable for team A (1 if we are playing them, 0 if we are not)
Dummy variables for each other team we play
Dummy variable for each player on our side during that point

The dummy variable for each team is to that we can isolate better teams from the impact of each player.

For example, the headers would look like this

WinLoss,P_1,P_2,P_3,P_4,T_5,T_1,T_2,T_3

If Game 1 vs Team 1 had the following outcome:

Point 1: Players 1, 2 and 3 lost
Point 2: Players 1, 2 and 4 won
Point 3: Players 3, 4 and 5 lost

would look like this

WinLoss,P_1,P_2,P_3,P_4,T_5,T_1,T_2,T_3
0,1,1,1,0,0,1,0,0
1,1,1,0,1,0,1,0,0
0,0,0,1,1,1,1,0,0

In the actual data set the players are in groups of 5 but the above gives the general format. We try to keep players together on the same "lines" as we assume that helps build both team rapport and communication.

I then ran the below:

mydata <- read.csv('lines_data.csv')
attach(mydata)
wins2 = lm(WinLoss ~ P_1 + P_2 + P_3+ P_4 + P_5 + T_1 + T_2 + T_3 + T_4)
summary(wins2)

I noticed recently that if I change the order of the players, I get different co-efficient values for each player.

Some searching here on Cross Validated led me to this question.

It makes sense that there is a high degree of multicollinearity between the player dummy variables as the players are on the field in "lines"/"shifts" as mentioned above.

My question is to how to account for this when running the regression? Do I just need more data? Do I need more dummy variables?

Thanks in advance.

Best Answer

Since you don't include any portion of your data set, it's hard to be sure precisely what is going on, but here's what I'm guessing:

If you include all the possible categories as dummy variables plus an intercept, as R does by default, then you have a perfectly multicolinear system. A unique set of coefficients can't be identified in this case, so R excludes one of the dummy variables from your regression. This becomes the reference group, which is represented by the intercept now, and all other coefficients are measured relative to it. The dummy variable that R decides to exclude depends upon the order; that's why you get different results based upon the ordering. If you add and subtract the right combinations of coefficients, you can move from one regression to another and see that you get exactly the same results---see here, for example.

No, you don't need more data and you certainly don't need more dummy variables; in fact, you need less. Just exclude one of the categories from each dummy variable group and you'll be fine. But do note that the reported coefficients that you get will depend upon which group you exclude (but, again, when you add pieces correctly, you get the exact same results).

Best Answer

Related Solutions

Data Transformation – How to Choose the Best Transformation to Achieve Linearity

Regression – Modeling Win-Draw-Loss Outcomes in Sports

Related Question