Solved – Should I use dumthe variables or just assign numerical values to categorical predictors in regression / PCA

categorical datadata miningpcaregression

When wanting to apply PCA or Linear Regression on some data set, I understand that the explanatory variables (features) should have numerical values.

My current situation is that out of 79, I have 43 Factorial variables, which I intend to transform.

  1. 1st Option – Is to convert each factorial variable to X dummy variables. which gives a lot more features, of course.

  2. 2nd Option – Is to assign numerical value to each factor level, without order relevance, e.g. convert feature that gets the following levels: {"Interior", "Exterior"}, to get the following levels: {0, 1}

What do I need to consider before using each of the options?

Or, how to determine which option serves me best \ determine if that's the correct approach?

My end goal is to predict continuous variable using the explanatory variables (If that's something to consider).

Best Answer

For your example, where the variable has only two levels, the two approaches are the same. When there are more than two levels, you will usually (almost always) want dummy variables because there will not be a sensible numerical coding of the levels. And you don't need to code it yourself; every decent statistical package (R, SAS, SPSS etc) will do it for you; most offer a few choices on how to do it (dummy coding, effect coding, Helmert coding).

Related Question