Solved – How many parameters are in this model, and what should the sample size be

ancovainteractionmultiple regressionstatistical-power

Q1 I have a multiple regression model with two continuous variables (Cont1 and Cont2) and one categorical variable (Cat1) with three levels.
If I run a model containing all possible interactions (so two and three way), how many parameters would that be in my maximal model?

e.g. Dependent~Cont1*Cont2*Cat1

I suppose my confusion is to do with how many parameters the interactions consist of. For example, since I want to include all two and three way interactions in the model, would the number of parameters be:

5 for the direct effects (1 for each of Cont1 and Cont 2, and 3 for Cat1 since it has three levels)
3 for each of the two way interactions with Cat 1 (so Cont1*Cat1 = 3 AND Cont2*Cat1 = 3 as again there are three levels for Cat1)
1 for the two-way interaction between Cont1 and Cont2
6 for the three way interaction (this is the one I am most unsure about)
1 for the interval term

Or is that incorrect?

Q2 I read that n=30 is a reasonable proxy for the sample size needed per variable in a model – would I need n=30 for every parameter in the model, or simply every direct effect (taking account of the fact that the categorical variable has three levels)? If it is per parameter, does this include n=30 for the interval term too?

Apologies for the simple questions, am just getting myself a bit confused with the interaction terms/levels of the categorical!

Let me know if you need further info. (No data at the moment, this is hypothetical…). Thanks.

Best Answer

This is how it works. First, your categorical variable has to be introduced as a set of dummy variables. Unless you have a very good reason to do so, I recommend you to split the categorical variable into three dummies, one for each category. This dummies should be mutually exclusive (only one can be 1 for each data point) and exhaustive (atleast one must be 1 for each data point).

It's very important that you do not fall for the dummy trap, that arises because in your data matrix, the columns corresponding to the constant and your 3 dummy variables will form a linear combination. That perfect multicollinearity would make the determinant of $X'X$ equal to $0$, and as a singular matrix you won't be able to invert it. To avoid this you should only include two of the three dummies in the regression, which means that the interpretation of their coefficients would be based on the category not included (I recommend you to check this if you have any more doubts regarding using dummy variables for categorical variables: http://analyticstraining.com/2014/understanding-dummy-variable-traps-regression/).

So you would have a constant, 2 continous variables, 2 dummy variables (representing your categorical variable) and an error term. Or:

$y$ $=$ $\gamma_0$ $+$ $\rho_1x_1$ $+$ $\rho_2x_2$ $+$ $\gamma_1D_1$ $+$ $\gamma_2D_2$ + $\epsilon$

Now, if for some reason you want to include all possible interaction terms, then the equation would look something like this:

$y$ $=$ $\gamma_0$ $+$ $\rho_1x_1$ $+$ $\rho_2x_2$ $+$ $\gamma_1D_1$ $+$ $\gamma_2D_2$ $+$ $\theta_1x_1x_2$ $+$ $\theta_2x_1D_1$ $+$ $\theta_3x_1D_2$ $+$ $\theta_4x_2D_1$ $+$ $\theta_5x_2D_2$ $+$ $\theta_6x_1x_2D_1$ $+$ $\theta_7x_1x_2D_2$ $+$ $\epsilon$

So 12 parameters. Clearly $D_1D_2$ wouldn't make any sense, and neither $x_1D_1D_2$ or $x_2D_1D_2$.

I don't know how much data you would need for this regression to be valid, but if sample size is a concern to you I would spend efforts in reducing the number of parameters. For instance, do you have any reason to include all the interaction terms or are you just trying to cover every possibility? Maybe some of these interections terms don't make much sense given your model.

It's always important trying to be parsimonious when dealing with data constraints.

Related Solutions

Linear Model – How Many Parameters in a Linear Model with Interaction

The model has 7 parameters because of the 3-category categorical variable which will have 2 ``main effects'' parameters in the model (1 of the categories is omitted as the reference category). There will also be a parameter for the interaction between each of the levels of the categorical variable with the continuous variable:

Continuous variable main effect
Quadratic effect
Category 1 main effect
Category 2 main effect
Continuous variable $\times$ category 1 interaction effect
Continuous variable $\times$ category 2 interaction effect
Intercept

Using your notation, the regression equation should be $$y=\beta_0 + \beta_1 x_1 + \beta_2 x_1^2 + \beta_3 d_1 + \beta_4 d_2 + \beta_5d_1x_1 + \beta_6d_2x_1$$

Solved – How should I model interactions between explanatory variables when one of them may have quadratic and cubic terms

None of those approaches will work properly. Approach 3. came close, but then you said you would prune out insignificant terms. This is problematic because co-linearities make it impossible to find which terms to remove, and because this would give you the wrong degrees of freedom in hypothesis tests if you want to preserve type I error.

Depending on the effective sample size and signal:noise ratio in your problem I'd suggest fitting a model with all product and main effect terms, and interpreting the model using plots and "chunk tests" (multiple d.f. tests of related terms, i.e., a test for overall interaction, test for nonlinear interaction, test for overall effect including main effect + interaction, etc.). The R rms package makes this easy to do for standard univariate models and for longitudinal models when $Y$ is multivariate normal. Example:

# Fit a model with splines in x1 and x2 and tensor spline interaction surface
# for the two.  Model is additive and linear in x3.
# Note that splines typically fit better than ordinary polynomials
f <- ols(y ~ rcs(x1, 4) * rcs(x2, 4) + x3)
anova(f)   # get all meaningful hypothesis tests that can be inferred
           # from the model formula
bplot(Predict(f, x1, x2))    # show joint effects
plot(Predict(f, x1, x2=3))   # vary x1 and hold x2 constant

When you see the anova table you'll see lines labeled All Interactions which for the whole model tests the combined influence of all interaction terms. For an individual predictor this is only helpful when the predictor interacts with more than one variable. There is an option in the print method for anova.rms to show by each line in the table exactly which parameters are being tested against zero. All of this works with mixtures of categorical and continuous predictors.

If you want to use ordinary polynomials use pol instead of rcs.

Unfortunately I haven't implemented mixed effect models.

Best Answer

Related Solutions

Linear Model – How Many Parameters in a Linear Model with Interaction

Solved – How should I model interactions between explanatory variables when one of them may have quadratic and cubic terms

Related Question