Solved – creating interaction term for dumthe variables and categorical variables

interaction

I want to create interaction term by using dummy variables and categorical variables.

For example, if I want to create interaction term by gender(0=male, 1=female) and
education level(0=less than elementary, 1= middle and high school, 2= college or more)

Is it right to multiply these two terms?
0(male) x 0 (less than elementary) = 0
0(male) x 1 (middle and high) = 0
0(male) x 2 (college or more) = 0

Above examples get same variable ' 0'.

Can you give me a right way to create interaction term?

Best Answer

The multiplication scheme only works if you want to treat the education variable as a continuous or ordinal variable that is linearly related to your dependent variable after controlling for sex. In most of the cases (from my experience), this linear assumption seldom holds as there are just too many heterogeneity within some of the educational categories.

If you treat education as a categorical variable, the computation of interaction terms is a bit tricky. Generally, if you have two categorical variables: $x_1$ with $j$ levels and $x_2$ with $k$ levels, to completely model their interactions you'll need $(j-1)\times (k-1)$ dummies. Here are the possible schemes:

enter image description here

Variable $female$ has two levels and variable $education$ has three, so to model the interaction you'll need $(2-1)\times(3-1) = 2$ more dummies on top of the dummies used for main effects.

For instance, if people in college is your reference group, to model the main effect, you'd need $female, D_{Elementary}, D_{Middle}$. To further model the interaction, you'll then need add the products $female\times D_{Elementary}$ and $female\times D_{Middle}$, which is Scheme 1 in the table.

Alternately, if you use other levels in your education variable as reference group, you can change your scheme accordingly. But overall, you should have 5 binary independent variables. These five dummies and the intercept together will allow you to estimate all the 6 means (2 sexes by 3 education levels = 6 possible combinations).


In real setting, we rarely do that by hands. Most software packages allow us to assign the type of variables so that the regression analysis will handle the variable appropriately.

In SAS, look into class statement in proc glm; in SPSS, check the factor and covariate panel in glm module; in R, use factor() or as.factor() functions to change the variable's nature; in Stata, look into adding prefix i. before your independent variable.

Related Question