Solved – the advantage of transforming variables from nominal to ordinal/numerical when it reduces variance explained in CatPCA

categorical dataoptimal-scalingpca

Context

I have a dataset of 8 categorical variables. And I want to apply Categorical Principal Component Analysis (CatPCA).

Before doing that, I have been advised to look at the transformation plots of all these variables after transforming them into nominal variables. This shows that some variables should be transformed into ordinal variables (when non-decreasing) and some into numeric (linear trend).

Now when doing the CatPCA with all nominal variables and comparing this to the CatPCA with newly transformed variables, there is a slight decrease in variance explained in the dependent variable.

Question

  • So if variance explained decreases after the transformations, what is the advantage of the transformation?

Best Answer

The main reason why applying the transformation is important is to avoid over fitting. In some contexts, there is a related issue of model parsimony.

Nominal transformations permit a variable to be scaled in any way that maximises variance explained in the sample. Thus, applying constraints on how that scaling can occur will reduce the variance explained in the sample. Such constraints include ordinal scaling where transformed values have to preserve the order of the values of the untransformed variable. Numeric scaling increases the constraint further by requiring equal numeric distances between categories.

However, all this freedom in nominal transformations can lead to serious over-fitting. If over fitting occurs, the model will not predict well to data not included in the model.

A simple way to train your intuition about over-fitting is to split your sample into two, then build a model in one sample, and examine the fit in the other half. Given the large amount of freedom in most optimal scaling models, it is particularly important when evaluating the fit of a model to adopt some form of cross validation.

Combining theory and an initial nominal level analysis allows you to add constraints to the data. While these may reduce variance explained in the sample data, they should increase variance explained to external samples.

The degree to which overfitting is a problem in optimal scaling varies based on several factors:

  • Smaller sample sizes have more problems with over fitting.
  • Models with many variables have more problems with over fitting.
  • Variables with more categories have more problems with over fitting.
  • Transformations that provide fewer constraints have more problems with over fitting (e.g., nominal < ordinal < spline ordinal < numeric).
Related Question