Solved – Why convert categorical data into numerical using one hot encoding

dimensionality reductionpcar

I don't have very strong statistical background, and I'm new in data science…

Now, I am practicing PCA (Principle Component Analysis) for dimension reduction. This tutorial looks very complete, but one step I got confused.
PCA Dimension Reduction Tutorial

Before they are using PCA in R or Python, all the categorical data has to be converted to numerical data. The tutorial is using one hot encoding, so that a column with different values will be separate into different columns. For example, if a column called Outlet_TypeSupermarket has 3 values Type 1, Type 2, Type 3 originally, after one hot encoding, it will become 3 columns Outlet_TypeSupermarket Type 1, Outlet_TypeSupermarket Type 2, Outlet_TypeSupermarket Type 3. They do this for each column. Then using PCA on all the generated columns.

Finally, in this case, even if PCA choses the most important 30 components (important columns), it maybe just using part of the original columns. For example, it may only use Outlet_TypeSupermarket Type 1, Outlet_TypeSupermarket Type 2 from the original Outlet_TypeSupermarket

Is this the right way to do dimension reduction? I thought the chosen columns would at least be complete columns from the original data set… If this is the correct way, could you tell me why?

Best Answer

PCA uses all original variables by design: each individual PC is a linear combination of all original dimensions. Therefore, even when discarding some PC dimensions obtained from PCA, the remaining PC dimensions still contain information from all original variables.

PCA requires numerical data from the mathematical point of view. Categorial variables don't have the required properties, e.g. relations between categories are not defined as they are for numeric information (e.g. variable 1 with possible levels A, B, and C: it is not defined if value A is e.g. twice as big as B). Creating dummy variables from categories solves this problem: each dummy variable can be treated as numeric (e.g. -1,1 or 0,1), which therefore allows PCA computation.