Solved – Possible reasons for changing P value significance after adjusting other covariate in a multiple regression

confoundingmulticollinearityp-valuerregression

Take any multiple regression model as an example, p value change from significant to non-significant or the opposite, after adjusting for one or several covariates in the model. My statistical teacher said the reason is because of potential collinearity, which is discussed quite often among people. But there could be other factors such as negative confounder, etc. Above all, I would like a good summary of the causes of changes in significance tests for individual predictors in multiple regression models, preferably illustrating with an example performed in R. For instance, simulating some data for each of reason, etc. Thanks in advance.

Best Answer

The significance of the model is not affected by collinearity except in the trivial sense that adding redundant variables will increase the df numerator (and decrease df denominator) without increasing the SSQ numerator or decreasing the SSQ denominator appreciably. For individual predictiors, collinearity means confounded variance and when variance is confounded it is difficult to reach strong conclusions about individual effects. Mathematically, this is reflected in the standard errors of the regression weights (they will be high with collinearity).

Related Solutions

Solved – Finding the best linear model for each response variable in multivariate multiple regression using R

To start with I will advice you fit different models with different endpoints. Leaves CorL CorD FilL AntL AntW StaL StiW HeiP. Why will you need a multiple endpoint model in the first place?. Among independent variables there is quite some multicollinearity like you have rightly said, is this something of concern?.Like you have indicated the VIF will help us with that info. Use the following R code to do VIF(variance inflation) regression before the with the selected variable you can use them for regression. Chose an appriopriate threshold for the VIF. I have used 5.

vif_func<-function(in_frame,thresh=10,trace=T){

  require(fmsb)

  if(class(in_frame) != "data.frame") in_frame<-data.frame(in_frame)

  #get initial vif value for all comparisons of variables
  vif_init<-NULL
  for(val in names(in_frame)){
    form_in<-formula(paste(val," ~ ."))
    vif_init<-rbind(vif_init,c(val,VIF(lm(form_in,data=in_frame))))



}
  vif_max<-max(as.numeric(vif_init[,2]))

  if(vif_max < thresh){
    if(trace==T){ #print output of each iteration
      prmatrix(vif_init,collab=c("var","vif"),rowlab=rep("",nrow(vif_init)),quote=F)
      cat("\n")
      cat(paste("All variables have VIF < ", thresh,", max VIF ",round(vif_max,2), sep=""),"\n\n")
    }
    return(names(in_frame))
  }

  else{

in_dat<-in_frame

#backwards selection of explanatory variables, stops when all VIF values are below "thresh"
while(vif_max >= thresh){

  vif_vals<-NULL

  for(val in names(in_dat)){
    form_in<-formula(paste(val," ~ ."))
    vif_add<-VIF(lm(form_in,data=in_dat))
    vif_vals<-rbind(vif_vals,c(val,vif_add))
  }
  max_row<-which(vif_vals[,2] == max(as.numeric(vif_vals[,2])))[1]

  vif_max<-as.numeric(vif_vals[max_row,2])

  if(vif_max<thresh) break

  if(trace==T){ #print output of each iteration
    prmatrix(vif_vals,collab=c("var","vif"),rowlab=rep("",nrow(vif_vals)),quote=F)
    cat("\n")
    cat("removed: ",vif_vals[max_row,1],vif_max,"\n\n")
    flush.console()
  }

  in_dat<-in_dat[,!names(in_dat) %in% vif_vals[max_row,1]]

}

return(names(in_dat))



 }

}`

I called you data set dep you can run it as follows

keep.dat<-vif_func(in_frame=dep,thresh=5,trace=T)

thresh is the threshold for VIF you can use 10 it depends on what you want. Here are the result of the good variables to use for regression independent of dependent variables you want to use.

var      vif             


Elev     2.34467892681204
 pH       4.82456111736694
 OM       9.60381685354609
 P        6.09927871325235
 K        6.55185481475336
 SoilTemp 9.76265101226991
 AirTemp  10.5786139945657

removed:  AirTemp 10.57861 

 var      vif             
 Elev     2.10463008453203
 pH       3.12149022393123
 OM       5.09603402410369
 P        5.56333186256743
 K        6.54812601407721
 SoilTemp 1.11028184053362

removed:  K 6.548126

This should be a good place to start.

Regression – Addressing Multiple Testing Problems with t-Tests for Multiple Coefficients in Linear Regression

For the multiple testing problem it might be good to take a look at Family-wise error boundary: Does re-using data sets on different studies of independent questions lead to multiple testing problems?.

In your example above, if you estimate a regression on one sample, then you can, with a t-test only decide on the significance of an individual coefficient, so, yes, there is a multiple testing problem if you draw conclusions for multiple coefficients, based on multiple t-tests.

Let us call the coefficients $\beta_i, i = 1, 2, \dots 5$, then you can test $H_0^{(1)}: \beta_1 = 0$ versus $H_1^{(1)}: \beta_1 \ne 0$ with a t-test and conclude that $\beta_1$ is significant. Note that, if you can not reject $H_0^{(1)}$ that you can not conclude that $\beta_1$ is zero (see What follows if we fail to reject the null hypothesis?).

So if you want to find 'statistical evidence' for $\beta_1$ not being zero, then your $H_1^{(1)}$ must be the expression that you want to 'prove', i.e. $H_1^{(1)}: \beta_1 \ne 0$ and then $H_0^{(1)}$ is the opposite, i.e. $\beta_1=0$. As you assume $H_0^{(1)}$ to be true (to derive a statistical contradiction) you have a fixed value for the parameter $\beta_1=0$ and therefrom it follows that you know the distribution of the estimator $\hat{\beta}_1$ (see theory on linear regression) and you can compute p-values.

Let us now take the case where you want to show that $(\beta_1 \ne 0 \text{ and } \beta_2 \ne 0)$, then this must be your $H_1^{(1,2)}$ and the opposite $H_0^{(1,2)}$ is that either $(\beta_1 = 0 \text{ or } \beta_2 = 0)$, as there is an 'or' in there you can not fix all the parameters of the combined distribution of $(\hat{\beta}_1, \hat{\beta}_2)$ !

Can you apply multiple testing procedures ? Most of them assume that the individual p-values are independent, in this example $\hat{\beta}_1$ and $\hat{\beta}_2$ can not be shown to be independent !

But, in an advanced book on econometrics (e.g. W.H. Greene, "Econometric Analysis") you will find applicable test for J (simultaneous) linear restrictions ($\beta_i=0, i=1,2,3,4,5$ is a special type of 5 linear restrictions) that avoid the multiple testing problem.

Best Answer

Related Solutions

Solved – Finding the best linear model for each response variable in multivariate multiple regression using R

Regression – Addressing Multiple Testing Problems with t-Tests for Multiple Coefficients in Linear Regression

Related Question