Modeling – When to Adjust Variables Before Creating a Model

feature selectionmathematical-statisticsmodelingpredictive-modelsstandardization

In what circumstances would you want to, or not want to scale or standardize a variable prior to model fitting? And what are the advantages / disadvantages of scaling a variable?

Best Answer

Standardization is all about the weights of different variables for the model. If you do the standardisation "only" for the sake of numerical stability, there may be transformations that yield very similar numerical properties but different physical meaning that could be much more appropriate for the interpretation. The same is true for centering, which is usually part of the standardization.

Situations where you probably want to standardize:

  • the variables are different physical quantities
  • and the numeric values are on very different scales of magnitude
  • and there is no "external" knowledge that the variables with high (numeric) variation should be considered more important.

Situations where you may not want to standardize:

  • if the variables are the same physical quantity, and are (roughly) of the same magnitude, e.g.
    • relative concentrations of different chemical species
    • absorbances at different wavelengths
    • emission intensity (otherwise same measurement conditions) at different wavelengths
  • you definitively do not want to standardize variables that do not change between the samples (baseline channels) - you'd just blow up measurement noise (you may want to exclude them from the model instead)
  • if you have such physically related variables, your measurement noise may be roughly the same for all variables, but the signal intensity varies much more. I.e. variables with low values have higher relative noise. Standardizing would blow up the noise. In other words, you may have to decide whether you want relative or absolute noise to be standardized.
  • There may be physically meaningful values that you can use to relate your measured value to, e.g. instead of transmitted intensity use percent of transmitted intensity (transmittance T).

You may do something "in between", and transform the variables or choose the unit so that the new variables still have physical meaning but the variation in the numerical value is not that different, e.g.

  • if you work with mice, use body weight g and length in cm (expected range of variation about 5 for both) instead of the base units kg and m (expected range of variation 0.005 kg and 0.05 m - one order of magnitude different).
  • for the transmittance T above, you may consider using the absorbance $A = -log_{10} T$

Similar for centering:

  • There may be (physically/chemically/biologically/...) meaningful baseline values available (e.g. controls, blinds, etc.)
  • Is the mean actually meaningful? (The average human has one ovary and one testicle)