I am conducting a research with a cross sectional data set (1 year, multiple countries). Now i am researching the likelihood of supportive leadership and firmsize. Now firm size is a catogorical variable (1-4), my teacher has advised me to create binary variables 0 or 1 for each category, excluding one (adding firm fixed effects), however i don't understand how to do this?

Should i add i.firmsize? Or is there a better way to this?

Thank you in advance!

## Best Answer

Your teacher is giving you a correct suggestion.

For categorical predictors, one usually defines a

dummy variablewhich encodes the fact that one observation belongs to one category or another. This is a general approach, so I'll show you a simple one fixed-effect model. Let's say that we have one categorical predictor $X$ that may assume 3 values $X \in \{X_1,X_2,X_3\}, \, i=1\ldots,N$. We want to fit a linear model between and a continuous random variable $Y$. Let's say that we have $N$ observations.Then a linear model would be represented by the equation

$$ Y=X\beta+\epsilon $$

or

$$ y_i = \beta_0 + \beta_1 x_i + \epsilon, \quad i=1,\ldots,N $$

Now, since we have three possible categories for $X$, using the three possible categories doesn't make sense, because $X_1, X_2, X_3$ can be numbers or other categories and we don't know how to multiply categories by slope coefficients.

Instead, since we are interested in modelling the

differencein average of $Y$ between the different categories, we can use a mathematical trick introducing thedummy variables.In our case, since we have 3 possible categories, we set one as reference (this will be modelled as the intercept of the model) and the other two as variables shifted by

one unit:$$ X_{dummy} = \pmatrix {0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ \vdots \\ 0 & 1 & 0 \\ 0 & 1 & 0 \\ 0 & 1 & 0 \\ \vdots \\ 0 & 0 & 1 \\ 0 & 0 & 1 \\ 0 & 0 & 1} $$

where the first column is always equal to 0, the second column is equal to 1 for the samples (rows) $x_i=X2$ and the third column is equal to 1 for the samples $x_i=X3$. With this encoding the model becomes this:

$$ y_i=\beta_0 + \beta_1 * x^{(dummy)}_{i,1} + \beta_2 * x^{(dummy)}_{i,2} + \beta_3 * x^{(dummy)}_{i,3} + \epsilon $$

which means that if $x_i=X1$, then $x^{(dummy)}_{i,1}=0, x^{(dummy)}_{i,2}=0, x^{(dummy)}_{i,3}=0$, giving the equation:

$$ y_i^{(X1)}=\beta_0 + \beta_1 * 0 + \beta_2 * 0 + \beta_3 * 0 + \epsilon = \beta_0 + \epsilon $$

if $x_i=X2$, then $x^{(dummy)}_{i,1}=0, x^{(dummy)}_{i,2}=1, x^{(dummy)}_{i,3}=0$, giving the equation:

$$ y_i^{(X2)}=\beta_0 + \beta_1 * 0 + \beta_2 * 1 + \beta_3 * 0 + \epsilon = \beta_0 + \beta_2 + \epsilon $$

and if $x_i=X3$, then $x^{(dummy)}_{i,1}=0, x^{(dummy)}_{i,2}=0, x^{(dummy)}_{i,3}=1$, giving the equation:

$$ y_i^{(X3)}=\beta_0 + \beta_1 * 0 + \beta_2 * 0 + \beta_3 * 1 + \epsilon = \beta_0 + \beta_3 + \epsilon $$

We can simplify by dropping $\beta_1$ (and renaming $\beta_2=\beta_1$, and $\beta_3=\beta_2$) because it's always equal to 0, getting a linear model

$$ y_i=\beta_0 + \beta_2 * x^{(dummy)}_{i,2} + \beta_3 * x^{(dummy)}_{i,3} + \epsilon $$

After seeing how the model encodes the 3 categories, it becomes easy to see how the parameters can be interpreted:

$$ y_i^{(X1)} = \beta_0 + \epsilon $$

the intercept $\beta_0$ represents the average $Y$ for the samples belonging to category $X_1$.

$$ y_i^{(X2)} = \beta_0 + \beta_1 + \epsilon $$

the first coefficient $\beta_1$ represents the

average differencebetween the $Y$ of category $X_2$ and $X_1$ samples.And finally,

$$ y_i^{(X3)} = \beta_0 + \beta_2 + \epsilon $$

the first coefficient $\beta_2$ represents the

average differencebetween the $Y$ of category $X_3$ and $X_1$ samples.Important: is this the only way to encode the categories into a linear model? No. There are other ways, each of them requiring an opportune change in the interpretation of the model parameters.Practical aspects:if you use R, this is automatically done by setting the categorical variable as a

factor.gives you:

with the intercept all equal to 1 for the formulation $Y=X\beta$, with $\beta=(\beta_0, \beta_1, \beta_2)$.