Solved – How many parameters are in this model, and what should the sample size be

ancovainteractionmultiple regressionstatistical-power

Q1 I have a multiple regression model with two continuous variables (Cont1 and Cont2) and one categorical variable (Cat1) with three levels.
If I run a model containing all possible interactions (so two and three way), how many parameters would that be in my maximal model?

e.g. Dependent~Cont1*Cont2*Cat1

I suppose my confusion is to do with how many parameters the interactions consist of. For example, since I want to include all two and three way interactions in the model, would the number of parameters be:

  • 5 for the direct effects (1 for each of Cont1 and Cont 2, and 3 for Cat1 since it has three levels)
  • 3 for each of the two way interactions with Cat 1 (so Cont1*Cat1 = 3 AND Cont2*Cat1 = 3 as again there are three levels for Cat1)
  • 1 for the two-way interaction between Cont1 and Cont2
  • 6 for the three way interaction (this is the one I am most unsure about)
  • 1 for the interval term

?

Or is that incorrect?

Q2 I read that n=30 is a reasonable proxy for the sample size needed per variable in a model – would I need n=30 for every parameter in the model, or simply every direct effect (taking account of the fact that the categorical variable has three levels)? If it is per parameter, does this include n=30 for the interval term too?

Apologies for the simple questions, am just getting myself a bit confused with the interaction terms/levels of the categorical!

Let me know if you need further info. (No data at the moment, this is hypothetical…). Thanks.

Best Answer

This is how it works. First, your categorical variable has to be introduced as a set of dummy variables. Unless you have a very good reason to do so, I recommend you to split the categorical variable into three dummies, one for each category. This dummies should be mutually exclusive (only one can be 1 for each data point) and exhaustive (atleast one must be 1 for each data point).

It's very important that you do not fall for the dummy trap, that arises because in your data matrix, the columns corresponding to the constant and your 3 dummy variables will form a linear combination. That perfect multicollinearity would make the determinant of $X'X$ equal to $0$, and as a singular matrix you won't be able to invert it. To avoid this you should only include two of the three dummies in the regression, which means that the interpretation of their coefficients would be based on the category not included (I recommend you to check this if you have any more doubts regarding using dummy variables for categorical variables: http://analyticstraining.com/2014/understanding-dummy-variable-traps-regression/).

So you would have a constant, 2 continous variables, 2 dummy variables (representing your categorical variable) and an error term. Or:

$y$ $=$ $\gamma_0$ $+$ $\rho_1x_1$ $+$ $\rho_2x_2$ $+$ $\gamma_1D_1$ $+$ $\gamma_2D_2$ + $\epsilon$

Now, if for some reason you want to include all possible interaction terms, then the equation would look something like this:

$y$ $=$ $\gamma_0$ $+$ $\rho_1x_1$ $+$ $\rho_2x_2$ $+$ $\gamma_1D_1$ $+$ $\gamma_2D_2$ $+$ $\theta_1x_1x_2$ $+$ $\theta_2x_1D_1$ $+$ $\theta_3x_1D_2$ $+$ $\theta_4x_2D_1$ $+$ $\theta_5x_2D_2$ $+$ $\theta_6x_1x_2D_1$ $+$ $\theta_7x_1x_2D_2$ $+$ $\epsilon$

So 12 parameters. Clearly $D_1D_2$ wouldn't make any sense, and neither $x_1D_1D_2$ or $x_2D_1D_2$.

I don't know how much data you would need for this regression to be valid, but if sample size is a concern to you I would spend efforts in reducing the number of parameters. For instance, do you have any reason to include all the interaction terms or are you just trying to cover every possibility? Maybe some of these interections terms don't make much sense given your model.

It's always important trying to be parsimonious when dealing with data constraints.