Solved – In general, does normalization mean to normalize the samples or features

categorical datamachine learningnonlinearitynormalizationregression-strategies

I'm just getting into machine learning, and I have seen two conflicting practices for normalization. To be concrete, let's suppose that we have a $n \times d$ matrix containing our training data, where $n$ is the number of samples and $d$ is the number of features.

When people say that they normalize their data before running whatever algorithm, I have seen that they do one of the following things:

  • normalize the columns of the data matrix so that $A_{1,i}^2 + A_{2, i}^2 + \cdots + A_{n, i}^2 = 1$ for each feature $i$
  • normalize the rows of the matrix so that each sample vector has the same norm

In general, when someone refers to normalization of data, which of the following are they referring to?

I was under the impression that it was the first one (seems to make the most sense to me), but looking at the documentation for sklearn's preprocessing library, it appears that the default behavior is the second one. This doesn't make sense to me.

Best Answer

Normalization is much trickier than most people think. Consider categorical and nonlinear predictors. Categorical (multinomial; polytomous) predictors are represented by indicator variables and should not be normalized. For continuous predictors, most relationships are nonlinear, and we fit them by expanding the predictor with nonlinear basis functions. The simplest case is perhaps a quadratic relationship $\beta_{1}x + \beta_{2}x^2$. Do we normalize $x$ by its standard deviation then square the normalized value for the second term? Do we normalize the second term by the standard deviation of $x^2$?

The mere use of normalizing so that the sum of squares for a column equals one, or normalizing by the standard deviation assumes that the predictor is one such that squaring it is the right thing to do. In general this only works correctly when the predictor has a symmetric distribution. For asymmetric distributions, the standard deviation is not an appropriate summary statistic for dispersion. One might just as easily entertain Gini's mean difference or the interquartile range. It's all arbitrary.

Related Question