Solved – How to control for categorical variable in regression

categorical datacategorical-encodingregression

I'm trying to analyze two negatively-correlated variables, A and B (where A is the independent variable) while somehow taking into account a categorical variable C, with the intention of highlighting data that deviates above expected values.

For example, in the following subset of my data:

#, A, B, C
1, 14, 55, "X"
2, 12, 75, "X"
3, 10, 65, "X"
4, 14, 40, "Y"
5, 12, 30, "Y"
6, 10, 35, "Y"

Average:
A, B
14, 55
12, 60
10, 65    

I'd like to be able to highlight data point 2 because it deviates above the average value, but I'd also like to highlight data point 4, because although it deviates below the average value, it deviates above the expected value within its category.

I know how to do a simple linear regression on A and B, but I don't know how to account for the categorical variable.

Best Answer

Based on your example provided, the qualitative variable "C" only has two levels, or possible values, we can incorporate it into a regression model by creating an indicator or dummy variable that takes on two possible numerical values.

Having that, $x_{1} = A, x_{2} = B$, and $x_{3} = C$, for linear regression, $$y = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \beta_{3}x_{3} + \epsilon_{i}$$ $$\therefore y = \beta_{0} + \beta_{1}A + \beta_{2}B + \beta_{3}C + \epsilon_{i}$$

From the variable "C", we can create a new variable that takes the form,

$$\ C = \left\{ \begin{array}{l l} 1 & \quad \text{if $ith$ is X}\\ 0 & \quad \text{if $ith$ is Y} \end{array} \right.\\$$

and use this variable as a predictor in the regression equation. This results in the model

$$\ y_{i} = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \beta_{3}x_{3} + \epsilon_{i} = \left\{ \begin{array}{l l} \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \beta_{3}x_{3} + \epsilon_{i} & \quad \text{if $ith$ is X}\\ \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \epsilon_{i} & \quad \text{if $ith$ is Y} \end{array} \right.\\$$

Now $\beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2}$ can be interpreted as Y, while $\beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \beta_{3}x_{3}$ is interpreted as X.

Related Question