Normalization – How to Normalize a Data Set Using Z-Scores from a Related Data Set

normalizationratingz-score

I am trying to come up with a normalization for a set of data. My company puts our products through a series of testers (people) to score our products on key metrics. I am able to access all records from each tester and when I look at the averages, I find that some testers always have scores higher than the standard deviation and some testers always have scores below the standard deviation. Since we only have 3-4 testers on each product, this variance in testers could impact our decision on products.

I was able to calculate a Z-Score for each tester. I was thinking about using this to create some form of normalization factor that I could apply to their raw scores. I didn't think using a z-score transformation was the right approach as this process is using a z-score from a different data set. I want the normalization to be using the testers scores and applying that to the products overall score.

Is this the right approach? I'm not the best at statistics so maybe there is a totally different method that works better.

Thank you

Further info:
Testers will have already tested hundreds of products (depends on how long they've been on staff). Products are randomly assigned to testers, testers never test the same product twice. They will typically score on 8 keys areas on a scale of 1-5 per product. We take an average of these scores to come up with the tester's final score (average). The products are then then sent to other testers for their averages. The testers scores are all averaged together to get a final score.

I know sometimes averaging averages can be bad, but in this case, the scale of ratings (1-5) never changes, so I believe it works here.

Best Answer

I interpret your approach as fitting a simple model to existing data. Subsequently you apply the fitted model to predict parameters for new products, which is perfectly OK.

This is the model, in technical terms:

Each product has a hidden "true" quality measure $x$ (the z-score).
For a randomly chosen product from the (hypothetical) population, containing the known and all possible future products, we get a random quality measure $X$. This random variable has mean $\mathrm{E}X = 0$ and variance $\mathrm{Var} X=1$, but is not necessarily normal distributed.
When the product $i$ is tested by tester $j$ , it gets assigned and average score $Y_{ij}$ within $[0, 5]$. This average score depends on the tester - some tend to give higher scores, some tend to give very similar scores to almost all products. And then there are random fluctuations on top of it. Let $\mu_j$ be the (theoretical) mean score given by tester $j$, and $\sigma_j$ be a scale factor, and $E_{ij}$ the random fluctuations, then $$ Y_{ij} = \sigma_j x_i + \mu_j + E_{ij}.$$ Your proceed now by estimating $\mu_j$ and $\sigma_j$ for each tester from the data, and for a new product, you predict the score $x$ from observed $y_{ij}$. I don't know exactly how you are doing this, but probably you have noticed that you get different results depending on $j$ when you just invert the model and calculate $$ \hat{x}_{i(j)} = \frac{y_{ij} - \hat\mu_j}{\hat\sigma_j}. $$ As a quick fix for discrepancies, I suggest to to just take the average of these fitted values as your guess for the true z-score, $$ \hat{x_i} = \frac{1}{n_i}\sum_{j=1}^{n_i}\frac{y_{ij} - \hat\mu_j}{\hat\sigma_j}. $$ Here $j=1,\dots, n_i$ indexes the testers that scored product $i$.

Your problem goes under the name inverse estimation or inverse prediction. Only few threads here on cv deal with this topic, possibly because methods for inverse estimation are less commonly used (and taught) than ordinary regression problems. Typically it concernes normal distributed data with no replicates for $i$ and no experimental design as yours, also in this post with an answer by @kjetilbhalvorsen. You can find an introduction into the topic in Greenwell and Kabban (2014), "investr: An R Package for Inverse Estimation" (The R Journal Vol. 6/1)

_{Sorry, I accidentially posted a comment as answer, originally (which kind of forced me to write an answer :-). Here is the first version (needed to understand the first comments to this answer):} _{"Welcome to cv, Citanaf! Can you give us a rough idea about the dimension of the data you have? How many products have testers already tested, typically?"}

Related Solutions

Solved – Computation of composite score or summed score from Likert scale

I also wonder what exact questions you have. I am guessing that you are wondering several things:

Does your questionnaire (& associated scoring scheme) seem reasonable?
Can you treat the score derived from the questionnaire as interval/ratio data?
Is Spearman's correlation an appropriate test / measure of effect size?

If these are not your real questions, leave me a comment, or edit the question, and I can try to update this answer so that it would be more helpful. I will answer each possible question in turn.

Unfortunately, I have no substantive knowledge of your subject area, and not too much about psychometrics, either. One thing to think about when making a questionnaire is whether all questions are actually getting at the same underlying issue. I'm guessing your goal is to tap different aspects of whether therapists are aware of, and employ, correct procedures for working with bilingual clients. To the extent that all items (i.e., questions) are measuring that, combining them is quite reasonable. This can be checked with Factor Analysis, although I don't know if that's within your skill set. Another possibility is to make several different 2-way cross-tabulation tables, and see if most of the counts lie on or near the diagonal. This would mean that therapists who give a higher rating in response to one question also give a higher response to another. Spearman's correlations could serve this purpose as well; again, you are looking for strong, positive correlations. Finding that they all correlate with each other suggests treating them as 1 construct (i.e., combining them) is a reasonable thing to do. Others on CV have more expertise with psychometrics, perhaps they will weigh in with better advice.

Regarding the scoring scheme, I think it makes perfect sense to combine the last three questions first, so that that issue doesn't outweigh the others. One suggestion I would make is to take the average of the scores. In other words, divide the final sum by 3 (giving a final score of 4 for your example; n.b., this means averaging twice or you would lose the downweighting of the last 3 items). This shouldn't have any substantive effect, but it does put the score back in the original scale, which can make it clearer and more interpretable.

Whether or not Likert items (i.e., questions) and Likert scales (i.e., questionnaires) can be treated as interval data has long been a controversial issue (they are most certainly not ratio). As I understand it, there are a few principles: a) items with more response levels are likely to be closer to equal-interval; b) combining multiple items into a scale makes it likelier to be equal-interval; c) using numbers instead of just words primes respondents to think of the intervals as equal and respond accordingly; and (d) whether or not the values are interval is mostly a matter of theoretical belief, not something that can be tested and determined in practice. My guess would be that your scale may be interval enough for your purposes. Ultimately, it's still the case that using ordinal logistic regression is ideal, but I don't know if you have that in your toolbox. At least two different questions on CV have treated issues pertaining to Likert scales and may be worth your time.

I'm not sure that I would use Spearman's correlation for your main analysis. I suspect you have two groups: monolingual therapists, and bilingual therapists. (If you have some therapists who speak more than 2 languages, you would have to decide what to do, but if you just have one or two, I would just make is a >1 group.) This suggests something like a t-test, or Mann-Whitney U-test if possible (which I believe is equivalent to ordinal logistic regression when there are only two groups), makes the most sense. (If you'll be using SPSS, this tutorial may help with the U-test.)

To assess the magnitude of the effect, I would suggest you use a standardized mean difference. This is often referred to as 'Cohen's d'. It is simply the difference between the two means, divided by the pooled standard deviation:
$$ d=\frac{\bar{x}_2-\bar{x}_1}{SD_{pooled}} $$
d gives you a measure of how far apart the two means are in units of the population's standard deviation.

Solved – Combining z-scores in clinical practice

First, it is unlikely that several different tests of inhibition would be totally uncorrelated across the population of your subjects. Because we're dealing with Z-scores, I suppose that means they're not completely independent tests. So if Evans 1996 says you'd need to know correlations to get a meaningful composite Z-score, that is correct.

Second, as far as I can see, the link is assuming that the four z-scores are completely independent. Suppose we use that independence assumption to get a combined score that weights each of the tests equally. Then we have four independent random variables $Z_1, Z_2, Z_3, Z_4$ each with $E(Z_i) = 0,$ and $Var(Z_i) = SD(Z_i) = 1.$ Let $A = \frac 14\sum_i Z_i.$

Then $$E(A) = E\left(\frac 14 \sum Z_i\right) = \frac 14 E\left(\sum Z_i\right)\\ = \frac 14 \sum E(Z_i) = \frac 14(0+0+0+0) = 0.$$ And $$V(A) = V\left(\frac 14 \sum Z_i\right) = \frac{1}{16} V\left(\sum Z_i\right)\\ = \frac{1}{16}\sum V(Z_i) = \frac{1}{16}(1+1+1+1) = \frac{4}{16} = \frac 14.$$

Addendum: Here are data simulated in R for four positively correlated tests administered to 50 subjects.

set.seed(2019); n = 50
v1 = rnorm(n,50,3); v2 = rnorm(n,60,4)
v3 = rnorm(n,40,2); v4 = rnorm(n,50,2)
w = rnorm(n,0,3)
x1 = v1+w; x2 = v2+w; x3 = v3+w; x4 = v4+w
MAT = cbind(x1,x2,x3,x4)
cor(MAT)

          x1        x2        x3        x4
x1 1.0000000 0.5622012 0.6479422 0.6513025
x2 0.5622012 1.0000000 0.5262790 0.6410636
x3 0.6479422 0.5262790 1.0000000 0.6738916
x4 0.6513025 0.6410636 0.6738916 1.0000000

Means and standard deviations of the scores for the 50 subjects are found below, and from them, the z-scores for each subject relative to the rest of the group of 50.

a = rowMeans(MAT); s = apply(MAT,1,sd)
z = (min(a)-mean(a))/sd(a)

Subject #27 had the lowest such z-score (-2.06), which (not surprisingly) puts that subject at about the 2nd percentile. Also, #27's scores on the four tests are shown below, followed by the corresponding individual z-scores relative to the group of 50, and percentages in a normal population below these z-scores.

z.27 = (min(a)-mean(a))/sd(a);  z.27;  pnorm(z.27)
[1] -2.06086
[1] 0.01965821

MAT[27,]
      x1       x2       x3       x4 
39.49311 51.58849 33.49084 44.05014 
(MAT[27,]-mean(a))/sd(a)
            x1         x2         x3         x4 
    -2.8288067  0.6598278 -4.5600233 -1.5144369 
round(pnorm((MAT[27,]-mean(a))/sd(a)),4)
    x1     x2     x3     x4 
0.0023 0.7453 0.0000 0.0650

Thus, in effect, one way to derive the z-score $-2.06$ as a 'combination' of z-scores $-2.83, 0.65 -4.56,$ and $-1.51$ is to use this subject's individual exam scores in the context of the other 49 subjects.

Best Answer

Related Solutions

Solved – Computation of composite score or summed score from Likert scale

Solved – Combining z-scores in clinical practice

Related Question