Regression Analysis – Controlling for a Categorical Variable

controlling-for-a-variableregression

I am performing a regression and have multiple control variables such as Religion, Caste, Highest Education of Adult in the Household, and Income Source. This might be an easy question but I was a bit confused with how to actually include them in the regression? So, for example if I want to control for education – do I make a dummy variable for each grade? like have a dummy variable for grade 1 and then grade 2 up till university? If this makes sense? and if I have income source, do I make a dummy variable for if they're a farmer, another one for shopkeeper, etc?

Also, would it be better for me to create a dummy variable for each category or if I assign each of them integer codes and then just keep 'income source' as the control variable?

I hope this question makes sense! Thank you!

Best Answer

You are right that categorial data can be encoded with dummy variables, but you only need $C-1$ dummy variables for $C$ levels. There are different methods to encode the levels with the dummy variables, but the easiest to understand is "treatment coding". If you have, e.g., three income sources "capital", "labour", "welfare" treatment coding uses two dummy variables as follows:

        labour welfare
capital      0       0
labour       1       0
welfare      0       1

In a linear regression, the intercept then describes the effect of "capital", intercept plus the coefficient of the dummy varaible "labour" describes the effect of "labour", and intercept plus the coefficient of the dummy varaible "welfare" describes the effect of "welfare".

Education can be treated the same way, although it might also be considered as an ordinal variable. Encoding it as a category allows for non-linear effects of this variable. In most situations, this does no harm, unless you explicitly want to suppress this.

If you also want to model different slopes per category with respect to other variables, you can use "interaction terms", which are specified by a specific syntax in statistical software. In R, it is * for interaction plus level dependent intercept or : for only level dependent slopes.

Related Question