Solved – Normalization of power-law distributed variables. Z-scores or Min-Max

normalizationpower lawstandardization

I need to make a composite index from the sum of three power-law distributed variables, which vary on different scales and have different variances.
For each variable there are many observations with very low scores and few observations with high scores.

I need to normalize the variables to obtain a common scale, before summing them to obtain a single score of the final index.
I'm considering two possibilities:

Min-Max Normalization

(Xi – min (X)) / (max(X) – min(X))

Standardization (Z-scores)

(Xi – mean(X)) / std(X)

Which solution is appropriate, given the power-law distribution of the three variables? Or are they both wrong? Why?

EDIT
Please have a look to an example of the distribution I am referring to:

enter image description here

I have three variables distributed like X and I need to normalized them before making a sum of the three.

Best Answer

Since you wish to use the final sum score for forecasting, I recommend that you cut your dataset into a training and a testing sample. If there is a time dimension, put the last 20% into the test sample; if not, just take a random 20%.

Then fit your model on the training sample: one model with one standardization, the other with the other standardization. Finally, predict into the testing sample with each model. (Be careful to standardize in the test sample using the parameters - like min, max, mean or SD - you derived from the training sample, to mimic the actual "production" use.) Finally, assess which model gives better forecasts, using whatever quality measure or loss function is relevant. You may well find that the differences between the standardization options in terms of forecast quality are negligible.

Related Question