Solved – Rescaling vs Standardization of features

feature selectionfeature-engineeringnormalizationstandardization

Is there any general rule of thumb or any justified rule to choose whether to scale a dataset using Rescaling (for each feature, subtract the min value and divid by the max – min) or Standardization (for each feature, subtract the mean value and divide by standard deviation) ?

Is there any simple test that I can do on my matrix X (where rows are instances and columns are features), to choose Rescaling vs Standardization ?

Best Answer

Disclaimer: the same question just popped in my mind and information is hard to find, please excuse me if the following is incorrect. The following is derived from reading a few resources and deriving further the reasoning myself (so there's not much references to give, but I think the reasoning developed below is quite sound statistically speaking).

Apparently it depends on your goal: frequentist (use standardization) or bayesian (use feature scaling aka normalization). Concretely, feature scaling does normalization of the values between the range 0.0 and 1.0, so you get a probability-like value with clear boundaries. On the other hand, standardization will "standardize" your values according to standard deviation as a reference: your values won't have any boundary (from 0.0 to infinity), but the value 1.0 will be equal to the standard deviation, 2.0 will be 2x the standard deviation, etc. So standardization is a way to convert your values to z-scores (and if I am not mistaken, standardization is also called z-transform in some fields).

This seems to matter when you want to do correlation analyses (frequentist, standardization is generally preferred), but for machine learning I have not found anything real documentation about the impact, although normalization is more common from my experience.

If I had to guess the impact of normalization vs standardization, I think it would depend on what you want to measure:

  • if you have features with clearly defined boundaries, you should prefer normalization as this allows to equalize the weights influence of each feature.
  • but if you have values with possibly extreme outliers, maybe standardization could be useful, so you still get reasonable values for common values (else the common values will be "squished" by the outliers extreme values, eg, if you have values [0 1 2 1E100], if you normalize you will get something around [0.0 0.0 0.0 1.0] which is probably not what you want). Also it would allow to easily apply windsorizing to trim outliers.

For other points of views on the subject, please see this answer for a more statistical oriented view and the very interesting discussion here.

Related Question