Solved – Dumthe variables for linear models with multiple levels

categorical datamultilevel-analysisregression

I'm currently working with data which has continuous variables and a hierarchical structure attached to it, think of measuring blood pressure, size and weight of different domestic animals (cats, dogs, birds) as well as of their species, family and order.

All data is measured on the level of the individuals, so there are no predictors on higher levels (although they could be generated by taking, e.g. the inter-level mean).

Let's say I want to predict the blood pressure ($y$) with the help of the weight ($x_1$) and the size ($x_2$).

Ignoring the hierarchical information, I could use a linear model $y= \beta_0 + x_1\beta_1 + x_2\beta_2$, which might be a very bad idea.

If I only had two different species measured, I could use one binary dummy variable $d$ and consider $y= \beta_0 + x_1\beta_1 + x_2\beta_2 + d \beta_3$.

If I wanted to consider $n$ species instead, would I need $n$ variables $d_1, \dots, d_n$ and consider
$$y= x_1\beta_1 + x_2\beta_2 + \sum_{i=1}^n d_i \beta_{i+2},$$

Is this the right approach for dummy variables if there are more than two categories?

The last model doesn't account for the hierachical structure. If I wanted to include the family as well (assuming for simplicity that there are only 2 families, but multiple species), would adding another binary predictor $\tilde d$ suffice?

$$y= x_1\beta_1 + x_2\beta_2 + \sum_{i=1}^n d_i \beta_{i+2} + \tilde d \beta_{n+3}$$

Is this the right approach to implement hierarchical dummy variables?

Edit: Removed the intercept as it was suggested in the answer of Maarten Buis.

Best Answer

Your first question: Close but not quite: you either need to leave one of your species out of the model as a reference category or you need to leave out the constant. In the former case each $\beta_{i+2}$ measures the difference between species $i$ and the reference species, so the parameter of the reference species is necessarily $0$, which means that the indicator variable for the reference species drops out of the model. In the latter case each $\beta_{i+2}$ is the constant for each species, which means that there is nothing left to do for the overall constant $\beta_0$ so it should drop out.

Your second question: No, all the information is already captured by the species indicator variables (a term I prefer over dummy variable), so there is nothing for the family indicator variables left to explain and they will be automatically dropped from the model due to perfect multicolinearity.

Related Question