Take any multiple regression model as an example, p value change from significant to non-significant or the opposite, after adjusting for one or several covariates in the model. My statistical teacher said the reason is because of potential collinearity, which is discussed quite often among people. But there could be other factors such as negative confounder, etc. Above all, I would like a good summary of the causes of changes in significance tests for individual predictors in multiple regression models, preferably illustrating with an example performed in R. For instance, simulating some data for each of reason, etc. Thanks in advance.
Solved – Possible reasons for changing P value significance after adjusting other covariate in a multiple regression
confoundingmulticollinearityp-valuerregression
Related Solutions
To start with I will advice you fit different models with different endpoints. Leaves CorL CorD FilL AntL AntW StaL StiW HeiP
. Why will you need a multiple endpoint model in the first place?. Among independent variables there is quite some multicollinearity like you have rightly said, is this something of concern?.Like you have indicated the VIF will help us with that info. Use the following R code to do VIF(variance inflation) regression before the with the selected variable you can use them for regression. Chose an appriopriate threshold for the VIF. I have used 5.
`
vif_func<-function(in_frame,thresh=10,trace=T){
require(fmsb)
if(class(in_frame) != "data.frame") in_frame<-data.frame(in_frame)
#get initial vif value for all comparisons of variables
vif_init<-NULL
for(val in names(in_frame)){
form_in<-formula(paste(val," ~ ."))
vif_init<-rbind(vif_init,c(val,VIF(lm(form_in,data=in_frame))))
}
vif_max<-max(as.numeric(vif_init[,2]))
if(vif_max < thresh){
if(trace==T){ #print output of each iteration
prmatrix(vif_init,collab=c("var","vif"),rowlab=rep("",nrow(vif_init)),quote=F)
cat("\n")
cat(paste("All variables have VIF < ", thresh,", max VIF ",round(vif_max,2), sep=""),"\n\n")
}
return(names(in_frame))
}
else{
in_dat<-in_frame
#backwards selection of explanatory variables, stops when all VIF values are below "thresh"
while(vif_max >= thresh){
vif_vals<-NULL
for(val in names(in_dat)){
form_in<-formula(paste(val," ~ ."))
vif_add<-VIF(lm(form_in,data=in_dat))
vif_vals<-rbind(vif_vals,c(val,vif_add))
}
max_row<-which(vif_vals[,2] == max(as.numeric(vif_vals[,2])))[1]
vif_max<-as.numeric(vif_vals[max_row,2])
if(vif_max<thresh) break
if(trace==T){ #print output of each iteration
prmatrix(vif_vals,collab=c("var","vif"),rowlab=rep("",nrow(vif_vals)),quote=F)
cat("\n")
cat("removed: ",vif_vals[max_row,1],vif_max,"\n\n")
flush.console()
}
in_dat<-in_dat[,!names(in_dat) %in% vif_vals[max_row,1]]
}
return(names(in_dat))
}
}`
I called you data set dep
you can run it as follows
keep.dat<-vif_func(in_frame=dep,thresh=5,trace=T)
thresh
is the threshold for VIF you can use 10 it depends on what you want.
Here are the result of the good variables to use for regression independent of dependent variables you want to use.
var vif
Elev 2.34467892681204
pH 4.82456111736694
OM 9.60381685354609
P 6.09927871325235
K 6.55185481475336
SoilTemp 9.76265101226991
AirTemp 10.5786139945657
removed: AirTemp 10.57861
var vif
Elev 2.10463008453203
pH 3.12149022393123
OM 5.09603402410369
P 5.56333186256743
K 6.54812601407721
SoilTemp 1.11028184053362
removed: K 6.548126
This should be a good place to start.
For the multiple testing problem it might be good to take a look at Family-wise error boundary: Does re-using data sets on different studies of independent questions lead to multiple testing problems?.
In your example above, if you estimate a regression on one sample, then you can, with a t-test only decide on the significance of an individual coefficient, so, yes, there is a multiple testing problem if you draw conclusions for multiple coefficients, based on multiple t-tests.
Let us call the coefficients $\beta_i, i = 1, 2, \dots 5$, then you can test $H_0^{(1)}: \beta_1 = 0$ versus $H_1^{(1)}: \beta_1 \ne 0$ with a t-test and conclude that $\beta_1$ is significant. Note that, if you can not reject $H_0^{(1)}$ that you can not conclude that $\beta_1$ is zero (see What follows if we fail to reject the null hypothesis?).
So if you want to find 'statistical evidence' for $\beta_1$ not being zero, then your $H_1^{(1)}$ must be the expression that you want to 'prove', i.e. $H_1^{(1)}: \beta_1 \ne 0$ and then $H_0^{(1)}$ is the opposite, i.e. $\beta_1=0$. As you assume $H_0^{(1)}$ to be true (to derive a statistical contradiction) you have a fixed value for the parameter $\beta_1=0$ and therefrom it follows that you know the distribution of the estimator $\hat{\beta}_1$ (see theory on linear regression) and you can compute p-values.
Let us now take the case where you want to show that $(\beta_1 \ne 0 \text{ and } \beta_2 \ne 0)$, then this must be your $H_1^{(1,2)}$ and the opposite $H_0^{(1,2)}$ is that either $(\beta_1 = 0 \text{ or } \beta_2 = 0)$, as there is an 'or' in there you can not fix all the parameters of the combined distribution of $(\hat{\beta}_1, \hat{\beta}_2)$ !
Can you apply multiple testing procedures ? Most of them assume that the individual p-values are independent, in this example $\hat{\beta}_1$ and $\hat{\beta}_2$ can not be shown to be independent !
But, in an advanced book on econometrics (e.g. W.H. Greene, "Econometric Analysis") you will find applicable test for J (simultaneous) linear restrictions ($\beta_i=0, i=1,2,3,4,5$ is a special type of 5 linear restrictions) that avoid the multiple testing problem.
Best Answer
The significance of the model is not affected by collinearity except in the trivial sense that adding redundant variables will increase the df numerator (and decrease df denominator) without increasing the SSQ numerator or decreasing the SSQ denominator appreciably. For individual predictiors, collinearity means confounded variance and when variance is confounded it is difficult to reach strong conclusions about individual effects. Mathematically, this is reflected in the standard errors of the regression weights (they will be high with collinearity).