Solved – For what kind of features will standardization be helpful

standardization

I have found that for some datasets, mean removal and variance scaling helps to fit a better model to data while for some datasets this does not help.

On what kind of data standardization will be helpful?

Are there some guidelines for applying this?

Best Answer

First off, standardization usually is taken to be

  1. subtraction of the mean

  2. division by the standard deviation.

The result has a mean 0 and standard deviation of 1.

Dividing by the variance will be wrong for any variable that is not a pure number. One of the reasons for standardization is to remove any influence of the units of measurement. The standard deviation always has the same units of measurement as the variable itself and division washes out those units.

There is no reason in principle why e.g. subtraction of the median and division by the interquartile range or in general any scaling

(value - measure of level) / measure of scale

might not be useful, but using mean and SD is by far the most common procedure. The idea that the Gaussian or normal is a reference distribution often underlies this, but using measures of level and scale other than the mean and standard deviation would often be useful, especially if you were interested in simple methods for identifying outliers (a very big topic covered by many threads in this forum).

The answer to your general question is pretty much tautologous: standardization is useful whenever difference in level, scale or units of measurement would obscure what you want to see. If you are interested in relative variations, standardize first.

If you wanted to compare the heights of mean and women, the units of measurement should be the same (metres or inches, whatever), and standardization is not required. But if the scientific or practical question requires comparing values relative to the mean, subtract the mean first. If it requires adjusting for different amounts of variability, divide by the standard deviation too.

Freedman, D., Pisani, R., Purves, R. Statistics New York: W.W. Norton (any edition) is good on this topic.