Solved – w-score vs. z-score with covariates

modelingregression-strategiesz-score

I want to predict GM volume in a group of patients based on their degree of cognitive impairment, corrected for age and sex. To have a more ‘disease specific’ measure of cognition, I use cognitive performance-scores in a large group (N=500) of healthy controls (HC) as a reference.

Me and my supervisor discussed two methods for doing this (the w-score vs. the z-score method):

1. w-score method:

a. calculate the effect of age and sex on cognitive score in the HC group (cognition = a + (b * age) + (c * sex))

b. predict cognitive score in the patient group based on the regression coefficients we found in the HC group

c. for each patient, subtract this predicted score from his actual score, and divide by the SD of the HC’s residuals
(w-score = (cognition.obs – cognition.pred)/SDres)

d. perform a regression in which w-score predicts GM volume
(GM volume = a + (b * w-score))

2. z-score method:

a. calculate the mean and SD of cognitive score in the HC group

b. for each patient, subtract the HC’s mean from his actual cognitive score, and divide by the HC’s SD (z-score = (cognition.obs – cognition.mean)/SD)

c. perform a regression in which z-score predicts GM volume, using age and sex as covariates (GM volume = a + (b * z-score) + (c * age) + (d * sex))

My supervisor wants to use the w-score method (because it is similar to the use of ‘norm tables’ which are based on a HC group and have corrections for age/sex). I actually prefer the z-score method, because the effect of age/sex on cognition in my patient group is different from the age/sex effect in the HC group.

If the logic behind correcting for age and sex is that they are a covariate/confounder in my regression (i.e. they directly relate to GM volume and might not be evenly distributed over cognitive scores), wouldn’t it make more sense to use the z-score method? In that way, you correct for the actual effect of age/sex that exists in the patient group (instead of a different effect that only exists in the HC group).

I’m very curious about your opinions, thank you in advance.

Anita

Best Answer

IMHO this is not based on statistical principles, and such manipulations cause observations to be correlated even if they started out independent. You are also making the strong assumption that the standard deviation is an appropriate normalizing statistic and that you have estimated the SDs very tightly. SD is useful for smooth symmetric distributions with non-heavy tails. This may not apply to your data.

The best approach to statistical modeling is to spend a lot of time formulating a comprehensive model that takes into account all known sources of variability that you can measure. This model uses the raw data and leads to comparisons of real interest.

Related Solutions

Solved – the correct method to calculate the Z score for the mean of a variable between two groups

Actually there is no difference between the three methods that you mentioned before and the results. Simply try to do scatter plot for the Z score by doing calculations with any of the previous methods and you will have the same results.

Solved – treat the mean of a set of z-scores as a z-score

Maybe someone else can explain the math behind it, but consider this quick demonstration: I generate five vectors, each 100 numbers long. Each of these vectors is on a different scale, so I standardize them (i.e., create z-scored variables). That is, the mean is zero and the standard deviation is 1 for each of these five latent construct variables:

set.seed(1839)

## create five different z-score variables that represent latent constructs
data <- data.frame(
  latent_construct_1 = scale(rnorm(100, 10, 4)),
  latent_construct_2 = scale(rnorm(100, 3, 18)),
  latent_construct_3 = scale(rnorm(100, -5, 7)),
  latent_construct_4 = scale(rnorm(100, 0, 8)),
  latent_construct_5 = scale(rnorm(100, 20, 20))
)

Let's check to make sure they are actually z-scores:

> sapply(data, mean)
latent_construct_1 latent_construct_2 latent_construct_3 latent_construct_4 latent_construct_5 
     -2.203951e-16       1.634435e-17       1.400464e-17      -1.449145e-17       7.852226e-17 
> 
> sapply(data, sd)
latent_construct_1 latent_construct_2 latent_construct_3 latent_construct_4 latent_construct_5 
                 1                  1                  1                  1                  1

So, now let's say we average all five of these together:

## make a mean of all of these latent constructs
data$mean_latent_construct <- rowMeans(data)

Is this new variable a z-score? We can check to see if the mean is zero and standard deviation is one:

> ## is the mean zero?
> mean(data$mean_latent_construct)
[1] -2.436148e-17
> 
> ## is the standard deviation one?
> sd(data$mean_latent_construct)
[1] 0.4599126

The variable is not a z-score, because the standard deviation is not one. However, we could now z-score this mean variable. Let's do that and compare the distributions:

## z-score the mean latent construct
data$mean_latent_construct_z <- scale(data$mean_latent_construct)

## compare distributions
library(tidyverse)
data <- data %>% 
  select(mean_latent_construct, mean_latent_construct_z) %>% 
  gather(variable, value)

ggplot(data, aes(x = value, fill = variable)) +
  geom_density(alpha = .7) +
  theme_light()

The z-scored aggregate variable of z-scores looks a lot different from the aggregate variable of z-scores.

In short: No, a mean of z-scored variables is not a z-score itself.

Best Answer

Related Solutions

Solved – the correct method to calculate the Z score for the mean of a variable between two groups

Solved – treat the mean of a set of z-scores as a z-score

Related Question