Solved – centering and scaling (standardizing) a variable: use population or sample standard deviation

standardization

For centering and scaling a variable (e.g. prior to a regression, or to a visualization), the standard procedure, of course, is to subtract the mean then divide by the standard deviation.

But is it considered preferable to use the population standard deviation (i.e. divided by n) or the sample standard deviation (divided by n-1)? Does it depend on one's use?

Interestingly, the standard R and Python functions seem to make different choices here. Python's sklearn.preprocessing.scale() uses population standard deviation; R's scale() uses sample standard deviation.

(NOTE: there's a prior question here, but it pertains to a very specific psychological method, and the one answer isn't actually substantiated by anything.)

Best Answer

The short answer is 'it does not matter' in most cases. The goal of standardization is adjusting variables to have (roughly) similar distributions. This is usually necessary because many statistical learning methods assume they are, and otherwise, some variable may numerically overwhelm others during model fitting.

The reason behind dividing by standard deviation is because many methods assume that the variables are normally distributed, so standard normal distribution $N(0,1)$ (variance of 1) happens to be a convenient ideal. But in most cases, this is just arbitrary. You could scale to any sensible variance value (distribution $N(0,a)$), and it will not make any difference to your model performance.

Thus, the choice of sample standard deviation estimate rarely matters, as noted in scikit-learn documentation and the answer for that prior question.

In addition, even if you are in a situation where choice of standard deviation estimate could make a slight difference (e.g. multiple samples standardized separately to different distributions), there is no such thing as the 'best' standard deviation estimate. The uncorrected (divided by N) estimate actually has the maximum likelihood, and even the corrected (divided by N-1) estimate is still biased due to the square root. (See wiki article for more details.) As such, you should consult papers/guides on your method for their choice of standard deviation estimate.

Best Answer

Related Solutions

Solved – Need for centering and standardizing data in regression

Standardized vs Centered Variables – Key Differences in Regression

Related Question