Solved – regression with z-scores as composite variables

mathematical-statisticsmultiple regressionregressionspssz-score

So I have 5 IV's- a,b,c,d,e and one DV.

a is fine as is.
b & c measure the same concept and since their ranges are the same, I averaged the scores to create a composite variable.
d & e measure the same concept but their ranges are very different, so I averaged their z-scores to create a composite variable.

MY questions are- 1. Is this an apprpriate way of creating composite variables in order to do regression? 2- If I use z-scores for one composite variable, should I do the same for the composite variable for b&c and also use the z-score for a? 3- By using the averaged z-score, does it effect the interpretation of my regression results?

Best Answer

  1. Yes, this looks fine. You are welcome to pre-process your data any way you see fit. Your selections look uncontroversial, but even if they were controversial, there is no formal reason why you can't do it. Linear transformations of raw data are ubiquitous (z-score, etc.); non-linear transformations of raw data are common ($log(x), \sqrt{x}, x^2$); wildly non-linear transformations of raw data are acts of desperation and will usually prove to be useless, but they still aren't "wrong." Anyway, you aren't anywhere near that domain with your suggestions. :)

  2. No, you do not need to transform $(b+c)/2$ to a z-score. There is also no reason you should be forbidden from doing the same. The transformation is linear, so it creates a shift and a scaling along that axis, and your regression coefficients $\beta_0$ and $\beta_{(b+c)/2}$ will respond to this transformation, but the overall quality of your regression will not change. To emphasize this point, I might add that you could replace $(b+c)/2$ with $log((b+c)/2)$. This is clearly a different model, so I would expect a different $R^2$ -- maybe better! maybe worse! -- but again, there is no one stopping you from doing that, either. See point 1.

  3. The only change in interpretation would be the words you use to describe the regression coefficients. "One unit change in the [ value~of | z-score of ] of $(b+c)/2$ would change the response variable by $[\beta_{(b+c)/2} ~|~ \beta_{Z_{(b+c)/2}}]$, all other variables being held constant."