Solved – Dumthe variables in two-way ANOVA

anovamediationtwo-way

I do not understand how to use dummy variables and the statistics underlying them. I don't need to apply this or use it in SPSS or any software.

I do know that they can be used for categorical variables. I know I need to use them for a research proposal because my independent variable is an intervention (treatment) with three conditions (experimental, active control and control). The other independent variable is the time of testing (pre-test, post-test, follow-up). Both need to be coded using dummy variables. I have two dependent variables, and one of them is a mediator. I am using a 3×3 two-way mixed repeated measures ANOVA and a multiple regression for partial mediation. This is a research proposal so I have no data except for what I wrote here.

Where can I find the formulas for this?

Best Answer

In regression analysis without categorical variables, it's straight forward to include numeric predictors in a regression model. For example, if we wanted to predict, say, weights of children in a school as a linear function of age, we could specify the following model:

\begin{eqnarray*} Y_{i} & = & \beta_{0}+\beta_{age}X_{age,i}+\epsilon_{i} \end{eqnarray*}

where $Y_i$ is the weight of the $i$th child, $X_{age,i}$ is the age of the $i$th child, $\epsilon_i$ is a random error term associated with the $i$th child, $\beta_0$ is an intercept parameter and $\beta_{age}$ is the slope parameter associated with the variable age. A fitted model for this might look like this:

\begin{eqnarray*} E[Y] & = & 35+3X_{age} \end{eqnarray*}

What this model simply says, is that the expected weight for any student can be estimated by $35 + 3$ times the value of student's age. So if the student's age was $7$, then the expected weight of this student would be $35 + 3\times 7 = 56$.

Now, let's say, instead of using age to predict weight, we were interested in predicting a student's weight based on his race. How could this be represented in mathematical terms? After all, we can't multiply categories of race with estimated regression coefficients. For example, how would this function make any sense if a student were black: $E[Y]=35 + 3 \times Black$, since it doesn't make sense to multiply "3 times black" or "3 times white" or any category for that matter?

Dummy coding is a way to handle this. Dummy variables are a simple way to "code" (or map or translate) categorical information or categorical representations in our dataset so that categorical groups can be represented in mathematical terms in a regression model. They also facilitate interpretation. If we have a categorical variable, for example, "Race," then we may have several different categories/levels of this variable, say, Black, White, and Asian. How can we create a regression model and include race as a predictor similar to the way we worked with age?

Well, it turns out, that with dummy variables, we create new variables (the dummy variables) that are coded either zero (0) or one (1) to represent the categories. Generally speaking when $c$ categories are present, we will need $c-1$ dummy variables. In the race example, we have three race categories, so $c=3$. This means we will need $c-1=3-1=2$ dummy variables to represent the $3$ racial categories in our regression model. We'll call the new variables $X_{black,i}$ and $X_{white,i}$ (if you are wondering where the Asian category went, hold tight: I'll explain shortly). Then we'll code the information in our dataset as follows:

\begin{eqnarray*} X_{black,i} & = & \begin{cases} 1 & \text{if the $i$th student is black}\\ 0 & \text{otherwise} \end{cases} \end{eqnarray*}

and

\begin{eqnarray*} X_{white,i} & = & \begin{cases} 1 & \text{if the $i$th student is white}\\ 0 & \text{otherwise} \end{cases} \end{eqnarray*}

Using this dummy coding, a regression model for weight, $Y_i$ based on this dummy would be:

\begin{eqnarray*} Y_{i} & = & \beta_{0}+\beta_{black}X_{black_i}+\beta_{white}X_{white,i}+\epsilon_i \end{eqnarray*}

and the corresponding response function might be something like:

\begin{eqnarray*} E[Y] & = & 35+5X_{black}+3X_{white} \end{eqnarray*}

where $\hat{\beta}_0=35$, $\hat{\beta}_{black}=5$, and $\hat{\beta}_{white}=3$. To interpret this model, it's instructive to write out the model that would be estimated for a black student. When a student is black, $X_{black}=1$ and $X_{white}=0$, so the response function becomes:

\begin{eqnarray*} E[Y] & = & 35+5\times1+3\times0=35+5=40 \end{eqnarray*}

Now, when a student is white, $X_{black}=0$ and $X_{white}=1$, so the response function becomes:

\begin{eqnarray*} E[Y] & = & 35+5\times0+3\times1=35+3=38 \end{eqnarray*}

If the student is Asian, then both $X_{black}=0$ and $X_{white}=0$, and the response functions just becomes:

\begin{eqnarray*} E[Y] & = & 35+5\times0+3\times0=35+0+0=35 \end{eqnarray*}

As you can see the Asian category is represented by just the intercept in our model, so we don't need any $X$-value coded $1$ to represent it. By coding $X_{white,i}=0$ and $X_{black,i}=0$, we are representing the Asian racial category.

So, with the dummy coding, a black student is expected to have a mean weight of $40$ bounds, a white student a mean weight of $38$ bounds, and an Asian student a mean weight of only $35$ pounds.

As you can see dummy coding allow the regression model to change depending on the categories you are trying to predict. If you'd like to see some additional examples of how dummy coding works, this website has some excellent examples and explanations.

Lastly, it should be noted that you can use this type of coding universally in regression modeling, so it can be used with ANOVA models, mixed models, $3\times3$ factorial models, etc.

Related Question