Solved – How to combine two measurements at different scales without affecting outliers

normalizationstandardizationz-score

I have two set of measurements with different scale:

  • Unemployment: range (1000 – 100,000)
  • Unemployment growth in a year: range (0 -100)

I want to see the combine effect of these two measurements. Kind of an unemployment index.

My first idea was to normalise the values in both measurement sets using the min-max formula. Then taking the average of the normalised/scaled values. However, min-max scaling would affect the outliers in the sets. Isnt it ?

Another approach i was thinking was to calculate the z-score. Then determine the percentile from z -score. And averaging the percentiles for both set.

Can anyone suggest a better way to combine measurements with different scale ?

Best Answer

There are various possibilities that do/measure essentially different things. It's not so much a question of right or wrong but rather of what you want to achieve.

If you use percentiles (or ranks, which would be equivalent), it doesn't matter whether you do this from the original values, from z-scores or from any other 1:1 transformation. Percentiles will not be affected by outliers, however they will reduce information.

If quantitative differences are important, you may lose that information using percentages. On the other hand, there may be the issue of outlier influence if you use the quantitative information. To some extent this depends on how the data are distributed, what kind of outliers you have and how far they are away from the rest. I suppose these are valid correct values that are just atypical, but not erroneous observations? Ultimately it is a subject matter decision to what extent their outlyingness should actually count in the aggregation. As I wrote, you may want to ignore the quantitative information and use ranks or percentiles, however if you use the quantitative information, you may just want to treat them as appropriately outlying. Maybe your distributions are just skew, and transformation (such as log or square root) could help, then standardisation?

The thing with min/max-standardisation is that extreme outliers will basically nullify the differentiation of the non-outlying values (you write "affect the outliers" but actually this rather affects the other observations). With z-scores (what I'd call unit variance standardisation) this still happens, but somewhat less. You could also standardise to unit MAD (mean absolute deviation from the median) and zero median, which gives the outliers smaller influence on the other observations (and makes the outliers even more outlying). But without knowing your data I don't know how much of a problem the outliers actually are.

Related Question