Solved – Finding the best linear model for each response variable in multivariate multiple regression using R

multiple regressionmultivariate regressionrregressionself-study

I don't know if a similar problem has been asked before so if it has been, please provide me a link to the related/duplicate questions. I am sorry if I seem to be asking too much. But I really like to learn this stuff and this seems to be a good place to start asking.

I have been teaching myself statistics through self-study and I found Logan's Biostatistical Design and Analysis Using R very helpful in that it shows how the actual computations are done (in R) and how the results are interpreted. I particularly like the part about multiple regression (Chapter 9). I use R since it is the most accessible (and free) software that I can get my hands into.

Right now, I am trying to learn multivariate multiple regression. But unfortunately, I can't find a good resource. My specific problem is finding the best linear model for each response variable for the following morphometric data of a plant species (that some of my high school biology students are investigating), where Leaves, CorL, CorD, FilL, AntL, AntW, StaL, StiW, and HeiP are the response variables and pH, OM, P, K (nutrient variables), Elev, SoilTemp, and AirTemp (environment variables) are the independent variables.

I don't know if it is okay to proceed as in the case of only one dependent variable, but I went through the steps of Example 9B of Logan anyhow.

lily.csv

Leaves,CorL,CorD,FilL,AntL,AntW,StaL,StiW,HeiP,Elev,pH,OM,P,K,SoilTemp,AirTemp
55,213.4,114.6,170.3,10.6,2.35,210,6.7,0.93,1431,6.37,1,3,170,29,26
44,192.15,95.25,160.6,7.1,2.25,176.4,6.55,0.79,1471,6.02,1,0,180,25,23
38,156.75,95.5,155.2,5.65,1.8,170.9,4.4,0.78,1471,6.02,1,0,180,25,23
29,191.8,88.35,155.2,10,2.5,178.25,5.9,0.75,1464,5.99,1,3,150,25,22
36,200.85,99.4,161.9,6.5,1.55,187.4,6.15,0.8,1464,5.99,1,3,150,25,22
43,210.2,74,147,7,1,170,5,0.8,1464,5.99,1,3,150,25,22
34,183.2,97.3,149.5,6.9,1.85,168.8,5.45,0.71,1464,5.99,1,3,150,25,22
52,233.3,107.7,179.6,9.2,3.05,210,6.45,0.82,1464,5.99,1,3,150,25,22
43,205.7,108.8,164.6,9.4,2,190.9,5.15,0.66,1464,5.99,1,3,150,25,22
28,203.15,119.35,160.6,8.9,2.3,180,6.85,0.77,1503,5.98,3,2,240,29.5,25.5
45,188.85,100.5,150.6,6.4,2.3,174.85,7.7,0.84,1503,5.98,3,2,240,29.5,25.5
49,205.2,126.85,150.8,10.1,2.8,177.5,9,0.84,1487,6.09,4,4,180,26,25
35,187.7,102.35,142.1,5.55,1.85,175.35,5.75,0.56,1485,6.17,3.5,1,220,24,23
23,181.05,94.6,136.6,6.9,1.8,169.3,5.8,0.59,1485,6.17,3.5,1,220,24,23
31,172.5,63.7,113.6,5.2,1.5,151.2,4.7,0.57,1482,6.29,5,2,280,24,23
34,190.5,93.1,151.9,5.65,1.85,172.5,5.25,0.68,1482,6.29,5,2,280,24,23
41,185.85,85.2,148.6,5.9,1.05,169.6,5.9,0.62,1472,6.48,0.5,3,170,25.22,22.89
29,195,159.2,159.3,15,4,185,6.3,0.59,1472,6.48,0.5,3,170,25.22,22.89
31,115.6,108.6,165.8,8.5,3,200.5,7.5,0.83,1454,5.53,5,14,350,25.22,22.89
27,176.65,93.1,128.65,6.65,2.85,180.5,6.65,0.53,1454,5.53,5,14,350,25.22,22.89
33,210,119,148,7,3,193,6,0.62,1454,5.53,5,14,350,25.22,22.89
42,200,93,166,18.3,4.55,177,8,1.12,1454,5.53,5,14,350,25.22,22.89
42,205,101.4,166.8,9,2.5,190,8.2,1.12,1454,5.53,5,14,350,25.22,22.89
25,192.9,94.15,147.8,6.45,2.3,167.65,7.15,0.61,1445,5.59,4,7,260,25.22,22.89
36,187.95,65.05,150.2,6.55,2.7,177.5,6.55,0.52,1445,5.59,4,7,260,25.22,22.89
32,110.4,11.6,168.15,7.6,2,197.95,7.85,0.73,1481,6.29,1.5,1,80,25.22,22.89
29,185,80,143,9,2,179,7.5,0.69,1481,6.29,1.5,1,80,25.22,22.89
29,179.8,70.6,134.8,11.15,3.2,165.65,5.3,0.6,1481,6.29,1.5,1,80,25.22,22.89

Firstly, I tried to investigate for possible collinearity among the variables.

library(car)
lily = read.csv("lily.csv",header=T)
scatterplotMatrix(lily,diag="boxplot")

enter image description here

FilL, AntL, AntW, and HeiP seem to be non-normal so I made log10 transformations. This seems to work fine. (And it is fine for you to educate me at this point if I am doing it wrong. I'd appreciate it very much.)

scatterplotMatrix(~Elev + pH + OM + P + K + SoilTemp + AirTemp +
AirTemp + Leaves + CorL + CorD + log10(FilL) + log10(log10(log10(AntL)+0.1)+0.1) + 
log10(AntW) + StaL + StiW + log10(HeiP),data=lily,diag="boxplot")

enter image description here

I check for multicolinearity among the independent variables.

> cor(lily[,10:16])
                Elev          pH          OM           P           K
Elev      1.00000000  0.48252995 -0.06601928 -0.56726786 -0.28159580
pH        0.48252995  1.00000000 -0.58587694 -0.81673123 -0.70434283
OM       -0.06601928 -0.58587694  1.00000000  0.65931857  0.86478172
P        -0.56726786 -0.81673123  0.65931857  1.00000000  0.79782480
K        -0.28159580 -0.70434283  0.86478172  0.79782480  1.00000000
SoilTemp  0.14558365  0.01543524 -0.10436250 -0.05023853 -0.01041523
AirTemp   0.26450883  0.15711849  0.16862694 -0.09735977  0.11655030
            SoilTemp     AirTemp
Elev      0.14558365  0.26450883
pH        0.01543524  0.15711849
OM       -0.10436250  0.16862694
P        -0.05023853 -0.09735977
K        -0.01041523  0.11655030
SoilTemp  1.00000000  0.83202496
AirTemp   0.83202496  1.00000000

Among the independent variables, pairs P and pH, K and pH, P and K, OM and K, and SoilTemp and AirTemp have strong collinearity.

I also checked for collinearity among the dependent variables although I don't have an idea if this is a alright.

> cor(lily[,1:9])
            Leaves        CorL       CorD      FilL      AntL        AntW
Leaves 1.000000000  0.44495257 0.17903019 0.5222644 0.1495016 0.004680606
CorL   0.444952572  1.00000000 0.51084625 0.1319070 0.2101801 0.097530007
CorD   0.179030187  0.51084625 1.00000000 0.2368117 0.3297344 0.376806953
FilL   0.522264352  0.13190704 0.23681171 1.0000000 0.3932006 0.284738542
AntL   0.149501570  0.21018008 0.32973443 0.3932006 1.0000000 0.796401542
AntW   0.004680606  0.09753001 0.37680695 0.2847385 0.7964015 1.000000000
StaL   0.416083096  0.06574503 0.23272070 0.7762797 0.2701401 0.318744025
StiW   0.194927129 -0.05594094 0.08322138 0.3752195 0.3755628 0.445964273
HeiP   0.577737137  0.17603412 0.13911530 0.6348948 0.4583508 0.254173681
             StaL        StiW      HeiP
Leaves 0.41608310  0.19492713 0.5777371
CorL   0.06574503 -0.05594094 0.1760341
CorD   0.23272070  0.08322138 0.1391153
FilL   0.77627970  0.37521953 0.6348948
AntL   0.27014013  0.37556279 0.4583508
AntW   0.31874403  0.44596427 0.2541737
StaL   1.00000000  0.38306631 0.3794643
StiW   0.38306631  1.00000000 0.5039679
HeiP   0.37946433  0.50396793 1.0000000

From here, I can check for variance inflation and their inverses and possibly investigate interactions but I am really not sure now how to proceed or if it is alright at all to do these things in the multivariate case. And it seems to be a long way still to assessing the best multivariate model. In the case of the one dependent variable case, I can use the MuMIn package to automate the determination of the best fit but it doesn't work in the multiple response case.

How do I proceed from this point? I will also appreciate it very much if you can point me to a good book or online material (preferably with applications in R).

Best Answer

To start with I will advice you fit different models with different endpoints. Leaves CorL CorD FilL AntL AntW StaL StiW HeiP. Why will you need a multiple endpoint model in the first place?. Among independent variables there is quite some multicollinearity like you have rightly said, is this something of concern?.Like you have indicated the VIF will help us with that info. Use the following R code to do VIF(variance inflation) regression before the with the selected variable you can use them for regression. Chose an appriopriate threshold for the VIF. I have used 5.

`

vif_func<-function(in_frame,thresh=10,trace=T){

  require(fmsb)

  if(class(in_frame) != "data.frame") in_frame<-data.frame(in_frame)

  #get initial vif value for all comparisons of variables
  vif_init<-NULL
  for(val in names(in_frame)){
    form_in<-formula(paste(val," ~ ."))
    vif_init<-rbind(vif_init,c(val,VIF(lm(form_in,data=in_frame))))



}
  vif_max<-max(as.numeric(vif_init[,2]))

  if(vif_max < thresh){
    if(trace==T){ #print output of each iteration
      prmatrix(vif_init,collab=c("var","vif"),rowlab=rep("",nrow(vif_init)),quote=F)
      cat("\n")
      cat(paste("All variables have VIF < ", thresh,", max VIF ",round(vif_max,2), sep=""),"\n\n")
    }
    return(names(in_frame))
  }

  else{

in_dat<-in_frame

#backwards selection of explanatory variables, stops when all VIF values are below "thresh"
while(vif_max >= thresh){

  vif_vals<-NULL

  for(val in names(in_dat)){
    form_in<-formula(paste(val," ~ ."))
    vif_add<-VIF(lm(form_in,data=in_dat))
    vif_vals<-rbind(vif_vals,c(val,vif_add))
  }
  max_row<-which(vif_vals[,2] == max(as.numeric(vif_vals[,2])))[1]

  vif_max<-as.numeric(vif_vals[max_row,2])

  if(vif_max<thresh) break

  if(trace==T){ #print output of each iteration
    prmatrix(vif_vals,collab=c("var","vif"),rowlab=rep("",nrow(vif_vals)),quote=F)
    cat("\n")
    cat("removed: ",vif_vals[max_row,1],vif_max,"\n\n")
    flush.console()
  }

  in_dat<-in_dat[,!names(in_dat) %in% vif_vals[max_row,1]]

}

return(names(in_dat))



 }

}`

I called you data set dep you can run it as follows

keep.dat<-vif_func(in_frame=dep,thresh=5,trace=T)

thresh is the threshold for VIF you can use 10 it depends on what you want. Here are the result of the good variables to use for regression independent of dependent variables you want to use.

var      vif             


Elev     2.34467892681204
 pH       4.82456111736694
 OM       9.60381685354609
 P        6.09927871325235
 K        6.55185481475336
 SoilTemp 9.76265101226991
 AirTemp  10.5786139945657

removed:  AirTemp 10.57861 

 var      vif             
 Elev     2.10463008453203
 pH       3.12149022393123
 OM       5.09603402410369
 P        5.56333186256743
 K        6.54812601407721
 SoilTemp 1.11028184053362

removed:  K 6.548126 

This should be a good place to start.

Related Question