Solved – backwards stepwise regression, collinearity and regression to the mean

multicollinearityregression-to-the-meanstepwise regression

My research paper was recently rejected and some of the feedback I received was in relation to the statistical tests done/not done. I would like help in clarifying what I could do differently as the feedback was not to informative.

I am attempting to see which baseline characteristics (my independent variables) can predict who will improve the most in my dependent variable after an intervention. As it is not published yet I won’t give to many details but a similar example would be trying to decide if any baseline characteristics in humans (such as muscle mass, age, gender, alcohol use, pulse rate etc) can predict improvement in 100m foot race times after undergoing a strength exercise program. I have a cohort of about 100 individuals all undergoing the same intervention.

In order to test this I collected data on all my baseline values and measured participants 100m times before and after undergoing the strength exercise program. I then made a multiple regression model were I included previous known confounders and my baseline characteristics of interest and used backwards stepwise removal of non-significant regressors to end up with a model of 3 independent variables significantly associating with improvement in 100m race times. For the sake of the argument let’s make up the following; gender, thigh muscle mass and smoking status (yes/no).

I was asked/critiqued on the following (again examples are made up);

1; type of sports shoe is a well-known determinant of 100 m race times, were improvement in race times still associated with baseline thigh muscle mass after adjusting for choice of sport shoe?

-type of sport shoe was one of the independent variables included in my multiple regression model, however it was not significant when included with the other independent variables so it was removed in the backward stepwise removal process. Is any other statistical test more appropriate to run?

2, Could collinearity explain the results as several of the independent variables are likely to be similar

-I ran collinearity diagnostics in SPSS and did not receive any VIF values over 4 (with only one independent variable had a VIF at 4, the rest were under 3)

3; discuss regression to the mean as an explanation to my results

-I concede that it is likely that regression to the mean plays a part in which individuals improved the most/least but I don’t see how this impacts on the baseline characteristics in a significant way other than that these individuals are given greater weight in the results since they show the biggest change. I divided my cohort into tertiles based on improvement in race times and did not find that they differed in baseline values in any of my independent variables of interest.

Any help with any of the above much appreciated!

Best Answer

I only address one aspect of your question.. let see if the community agrees with me. At least, let see if I understood her well.

The variable that you include in your model must be driven by your question of research. Not by any sort of automatic significance-driven algorithm of selection. Why ? An oversimplified example:

Let say that you are interested in studying the number of birds in all the parks of the country. Let say that, for the $n$ parks of your sample, you know the number of seeds, $\#^{seeds}$, and the number of dogs, $\#^{dogs}$. Let say that your sampling, unfortunately, only considers the parks in which there are only old dogs... you know the number of dogs but you don't know how old they are.

Let say that, originally, your question of research is

What are the determinants of the number of birds in all the parks of the country ?

and your equation for $i=1,...,n$

$\#^{birds}_i = \beta_0 + \beta_1\#^{seeds}_i + \beta_2\#^{dogs}_i + \varepsilon_i$

Let say that -- because you do not know that you actually sampled over dogs that are old -- you failed to reject the non-significance of $\widehat{\beta}_2$. Then, let say that you -- step-wising manually -- drop out the number of dogs from the equation. If you do so, don't you think that you changed your question of research ? You do ! Without even knowing it, your question of research actually became

What are the determinants of the number of birds in all the parks of the country that are visited only by old dogs?

So what to do with a non-significant $\#^{dogs}$ ? You keep it in the model, because this non-significance actually is informative (e.g. about your sampling process). If you stewise-drop it out, you are actually biasing your estimation because you concentrate your estimation procedure on the very peculiar case you have, case which is strongly related to your (unaware) bias of selection (of old dogs). Relatively to your question of research your estimated coefficients will be biased.

To enrich my comment, in which I say that models are multidimensional random objects just as some variable are random. The term multidimensional here stands for the fact that each exogenous variable is a dimension of your model, of your question of research. If one of them is outlying, it does not mean that it always do so.

Finally, something else you should wonder is : do you do statistics to strive you programming skills or your statistic skills ? If it is the latter, don't use stepwise.