I also wonder what exact questions you have. I am guessing that you are wondering several things:
- Does your questionnaire (& associated scoring scheme) seem reasonable?
- Can you treat the score derived from the questionnaire as interval/ratio data?
- Is Spearman's correlation an appropriate test / measure of effect size?
If these are not your real questions, leave me a comment, or edit the question, and I can try to update this answer so that it would be more helpful. I will answer each possible question in turn.
Unfortunately, I have no substantive knowledge of your subject area, and not too much about psychometrics, either. One thing to think about when making a questionnaire is whether all questions are actually getting at the same underlying issue. I'm guessing your goal is to tap different aspects of whether therapists are aware of, and employ, correct procedures for working with bilingual clients. To the extent that all items (i.e., questions) are measuring that, combining them is quite reasonable. This can be checked with Factor Analysis, although I don't know if that's within your skill set. Another possibility is to make several different 2-way cross-tabulation tables, and see if most of the counts lie on or near the diagonal. This would mean that therapists who give a higher rating in response to one question also give a higher response to another. Spearman's correlations could serve this purpose as well; again, you are looking for strong, positive correlations. Finding that they all correlate with each other suggests treating them as 1 construct (i.e., combining them) is a reasonable thing to do. Others on CV have more expertise with psychometrics, perhaps they will weigh in with better advice.
Regarding the scoring scheme, I think it makes perfect sense to combine the last three questions first, so that that issue doesn't outweigh the others. One suggestion I would make is to take the average of the scores. In other words, divide the final sum by 3 (giving a final score of 4 for your example; n.b., this means averaging twice or you would lose the downweighting of the last 3 items). This shouldn't have any substantive effect, but it does put the score back in the original scale, which can make it clearer and more interpretable.
Whether or not Likert items (i.e., questions) and Likert scales (i.e., questionnaires) can be treated as interval data has long been a controversial issue (they are most certainly not ratio). As I understand it, there are a few principles: a) items with more response levels are likely to be closer to equal-interval; b) combining multiple items into a scale makes it likelier to be equal-interval; c) using numbers instead of just words primes respondents to think of the intervals as equal and respond accordingly; and (d) whether or not the values are interval is mostly a matter of theoretical belief, not something that can be tested and determined in practice. My guess would be that your scale may be interval enough for your purposes. Ultimately, it's still the case that using ordinal logistic regression is ideal, but I don't know if you have that in your toolbox. At least two different questions on CV have treated issues pertaining to Likert scales and may be worth your time.
I'm not sure that I would use Spearman's correlation for your main analysis. I suspect you have two groups: monolingual therapists, and bilingual therapists. (If you have some therapists who speak more than 2 languages, you would have to decide what to do, but if you just have one or two, I would just make is a >1 group.) This suggests something like a t-test, or Mann-Whitney U-test if possible (which I believe is equivalent to ordinal logistic regression when there are only two groups), makes the most sense. (If you'll be using SPSS, this tutorial may help with the U-test.)
To assess the magnitude of the effect, I would suggest you use a standardized mean difference. This is often referred to as 'Cohen's d'. It is simply the difference between the two means, divided by the pooled standard deviation:
$$
d=\frac{\bar{x}_2-\bar{x}_1}{SD_{pooled}}
$$
d gives you a measure of how far apart the two means are in units of the population's standard deviation.
First, it is unlikely that several different tests of inhibition would be
totally uncorrelated across the population of your subjects. Because we're dealing with Z-scores, I suppose that means they're not completely independent tests. So if Evans 1996 says you'd need to know correlations to get a
meaningful composite Z-score, that is correct.
Second, as far as I can see, the link is assuming that the four z-scores are
completely independent. Suppose we use that independence assumption to
get a combined score that weights each of the tests equally.
Then we have four independent random variables $Z_1, Z_2, Z_3, Z_4$ each with $E(Z_i) = 0,$ and $Var(Z_i) = SD(Z_i) = 1.$ Let $A = \frac 14\sum_i Z_i.$
Then
$$E(A) = E\left(\frac 14 \sum Z_i\right) = \frac 14 E\left(\sum Z_i\right)\\ = \frac 14 \sum E(Z_i) = \frac 14(0+0+0+0) = 0.$$
And
$$V(A) = V\left(\frac 14 \sum Z_i\right) = \frac{1}{16} V\left(\sum Z_i\right)\\ = \frac{1}{16}\sum V(Z_i) = \frac{1}{16}(1+1+1+1) = \frac{4}{16} = \frac 14.$$
Addendum: Here are data simulated in R for four positively correlated tests
administered to 50 subjects.
set.seed(2019); n = 50
v1 = rnorm(n,50,3); v2 = rnorm(n,60,4)
v3 = rnorm(n,40,2); v4 = rnorm(n,50,2)
w = rnorm(n,0,3)
x1 = v1+w; x2 = v2+w; x3 = v3+w; x4 = v4+w
MAT = cbind(x1,x2,x3,x4)
cor(MAT)
x1 x2 x3 x4
x1 1.0000000 0.5622012 0.6479422 0.6513025
x2 0.5622012 1.0000000 0.5262790 0.6410636
x3 0.6479422 0.5262790 1.0000000 0.6738916
x4 0.6513025 0.6410636 0.6738916 1.0000000
Means and standard deviations of the scores for the 50 subjects are found below, and from them, the z-scores for each subject relative
to the rest of the group of 50.
a = rowMeans(MAT); s = apply(MAT,1,sd)
z = (min(a)-mean(a))/sd(a)
Subject #27 had the lowest such z-score (-2.06), which (not surprisingly) puts that subject at about the 2nd percentile.
Also, #27's scores on the four tests are shown below, followed by the
corresponding individual z-scores relative to the group of 50, and percentages in a normal population below these z-scores.
z.27 = (min(a)-mean(a))/sd(a); z.27; pnorm(z.27)
[1] -2.06086
[1] 0.01965821
MAT[27,]
x1 x2 x3 x4
39.49311 51.58849 33.49084 44.05014
(MAT[27,]-mean(a))/sd(a)
x1 x2 x3 x4
-2.8288067 0.6598278 -4.5600233 -1.5144369
round(pnorm((MAT[27,]-mean(a))/sd(a)),4)
x1 x2 x3 x4
0.0023 0.7453 0.0000 0.0650
Thus, in effect, one way to derive the z-score $-2.06$ as a 'combination' of
z-scores $-2.83, 0.65 -4.56,$ and $-1.51$ is to use this subject's
individual exam scores in the context of the other 49 subjects.
Best Answer
I interpret your approach as fitting a simple model to existing data. Subsequently you apply the fitted model to predict parameters for new products, which is perfectly OK.
This is the model, in technical terms:
Your problem goes under the name inverse estimation or inverse prediction. Only few threads here on cv deal with this topic, possibly because methods for inverse estimation are less commonly used (and taught) than ordinary regression problems. Typically it concernes normal distributed data with no replicates for $i$ and no experimental design as yours, also in this post with an answer by @kjetilbhalvorsen. You can find an introduction into the topic in Greenwell and Kabban (2014), "investr: An R Package for Inverse Estimation" (The R Journal Vol. 6/1)
Sorry, I accidentially posted a comment as answer, originally (which kind of forced me to write an answer :-). Here is the first version (needed to understand the first comments to this answer): "Welcome to cv, Citanaf! Can you give us a rough idea about the dimension of the data you have? How many products have testers already tested, typically?"