I have a data set that contains both categorical variables and continuous variables. I was advised to transform the categorical variables as binary variables for each level (ie, A_level1:{0,1}, A_level2:{0,1}) – I think some have called this "dummy variables".
With that said, would it be misleading to then center and scale the entire data set with the new variables? It seems as if I would lose the "on/off" meaning of the variables.
If it is misleading, does that mean I should center and scale the continuous variables separately and then re-add it to my data set?
TIA.
Best Answer
When constructing dummy variables for use in regression analyses, each category in a categorical variable except for one should get a binary variable. So you should have e.g. A_level2, A_level3 etc. One of the categories should not have a binary variable, and this category will serve as the reference category. If you don't omit one of the categories, your regression analyses won't run properly.
If you use SPSS or R, I don't think the scaling and centering of the entire data set will generally be a problem since those software packages often interprets variables with only two levels as factors, but it may depend on the specific statistical methods used. In any case, it makes no sense to scale and center binary (or categorical) variables so you should only center and scale continuous variables if you must do this.