Solved – dumthe variables with overlapping categories

categorical datacategorical-encodingmodelingregression

I'm quite familiar with traditional dummy variable coding – code 1 for presence of the attribute and 0 for absence. A multi-category variable is then represented by a series of dummy variables while omitting 1 category as the reference, so for a variable with n categories I would include n-1 dummy variables.

Simple.

But what happens if I have overlap in my categories?

Here's a simple (slightly contrived) example to illustrate.

Let's say I'm looking at the effect of different sports on injury (a dichotomous outcome). There are 6 sports – football, baseball, basketball, soccer, lacrosse, and hockey.

Now, I know what you're thinking, these ARE mutually exclusive, there is no overlap. True, I could represent these sports with 5 dummy variables and use one, say football, as the reference.

But instead I want to look at some facet of the sport that is related to injury. Put differently, it's not the 'sport' per se, but the actions involved in playing each sport. Some sports involve the same actions, so there is overlap.

I would like to have dummies such as the following:

  1. 'ball' is hard (baseball, hockey, lacrosse)
  2. all players wear helmets (lacrosse, hockey, football)
  3. floor/ground is hard (hockey, basketball)

Now, I think I can do this so long as the dummy variables are not highly collinear. That would be tantamount to the so called 'dummy variable trap'. Right? How would I check this? VIFs for the dummies? Condition number?

Is there anything else I need to look out for? Anything I'm missing?

In my actual application I'm thinking of around 5 'facets' and there are well over 50 different categories. I can collapse these categories down into 5 or so catchall categories but I'd rather not do that for theoretical reasons that we don't need to get into at this point.

I could let the machine chose the 'dimensions' or 'facets' via exploratory factor analysis, but I have a very specific set of theoretical 'facets' that I wish to test, hence the preference for dummy variables of my choosing.

Best Answer

Question:

Can Dummy variables have overlapping categories?

Answer:

No.

Explanation:

Dummy variables arise when you try to recode Categorical variables with more than two categories into a series of binary variables. Since these categories partition your dataset (i.e. each observation can be assigned to one and only one of these 'k' categories), there is no way that there can be any "overlapping".

Now, with respect to the actual example you provide, there are two issues you should be aware of since they probably would otherwise screw up your analysis entirely:

  1. The binary variables which you describe are based, more or less, on arbitrary distinctions (for instance, would astroturf--more or less a rug covering concrete--really qualify as "soft" ground?).
  2. There's a good chance your model (as described in the OP) suffers from Multicollinearity (that is, that a linear combination of two or more of your independent variables are highly correlated).

Just something you should keep in mind the next time you run a regression... Anyway, hope this helps.

Related Question