I don't know if a similar problem has been asked before so if it has been, please provide me a link to the related/duplicate questions. I am sorry if I seem to be asking too much. But I really like to learn this stuff and this seems to be a good place to start asking.
I have been teaching myself statistics through self-study and I found Logan's Biostatistical Design and Analysis Using R very helpful in that it shows how the actual computations are done (in R) and how the results are interpreted. I particularly like the part about multiple regression (Chapter 9). I use R since it is the most accessible (and free) software that I can get my hands into.
Right now, I am trying to learn multivariate multiple regression. But unfortunately, I can't find a good resource. My specific problem is finding the best linear model for each response variable for the following morphometric data of a plant species (that some of my high school biology students are investigating), where Leaves
, CorL
, CorD
, FilL
, AntL
, AntW
, StaL
, StiW
, and HeiP
are the response variables and pH
, OM
, P
, K
(nutrient variables), Elev
, SoilTemp
, and AirTemp
(environment variables) are the independent variables.
I don't know if it is okay to proceed as in the case of only one dependent variable, but I went through the steps of Example 9B of Logan anyhow.
lily.csv
Leaves,CorL,CorD,FilL,AntL,AntW,StaL,StiW,HeiP,Elev,pH,OM,P,K,SoilTemp,AirTemp
55,213.4,114.6,170.3,10.6,2.35,210,6.7,0.93,1431,6.37,1,3,170,29,26
44,192.15,95.25,160.6,7.1,2.25,176.4,6.55,0.79,1471,6.02,1,0,180,25,23
38,156.75,95.5,155.2,5.65,1.8,170.9,4.4,0.78,1471,6.02,1,0,180,25,23
29,191.8,88.35,155.2,10,2.5,178.25,5.9,0.75,1464,5.99,1,3,150,25,22
36,200.85,99.4,161.9,6.5,1.55,187.4,6.15,0.8,1464,5.99,1,3,150,25,22
43,210.2,74,147,7,1,170,5,0.8,1464,5.99,1,3,150,25,22
34,183.2,97.3,149.5,6.9,1.85,168.8,5.45,0.71,1464,5.99,1,3,150,25,22
52,233.3,107.7,179.6,9.2,3.05,210,6.45,0.82,1464,5.99,1,3,150,25,22
43,205.7,108.8,164.6,9.4,2,190.9,5.15,0.66,1464,5.99,1,3,150,25,22
28,203.15,119.35,160.6,8.9,2.3,180,6.85,0.77,1503,5.98,3,2,240,29.5,25.5
45,188.85,100.5,150.6,6.4,2.3,174.85,7.7,0.84,1503,5.98,3,2,240,29.5,25.5
49,205.2,126.85,150.8,10.1,2.8,177.5,9,0.84,1487,6.09,4,4,180,26,25
35,187.7,102.35,142.1,5.55,1.85,175.35,5.75,0.56,1485,6.17,3.5,1,220,24,23
23,181.05,94.6,136.6,6.9,1.8,169.3,5.8,0.59,1485,6.17,3.5,1,220,24,23
31,172.5,63.7,113.6,5.2,1.5,151.2,4.7,0.57,1482,6.29,5,2,280,24,23
34,190.5,93.1,151.9,5.65,1.85,172.5,5.25,0.68,1482,6.29,5,2,280,24,23
41,185.85,85.2,148.6,5.9,1.05,169.6,5.9,0.62,1472,6.48,0.5,3,170,25.22,22.89
29,195,159.2,159.3,15,4,185,6.3,0.59,1472,6.48,0.5,3,170,25.22,22.89
31,115.6,108.6,165.8,8.5,3,200.5,7.5,0.83,1454,5.53,5,14,350,25.22,22.89
27,176.65,93.1,128.65,6.65,2.85,180.5,6.65,0.53,1454,5.53,5,14,350,25.22,22.89
33,210,119,148,7,3,193,6,0.62,1454,5.53,5,14,350,25.22,22.89
42,200,93,166,18.3,4.55,177,8,1.12,1454,5.53,5,14,350,25.22,22.89
42,205,101.4,166.8,9,2.5,190,8.2,1.12,1454,5.53,5,14,350,25.22,22.89
25,192.9,94.15,147.8,6.45,2.3,167.65,7.15,0.61,1445,5.59,4,7,260,25.22,22.89
36,187.95,65.05,150.2,6.55,2.7,177.5,6.55,0.52,1445,5.59,4,7,260,25.22,22.89
32,110.4,11.6,168.15,7.6,2,197.95,7.85,0.73,1481,6.29,1.5,1,80,25.22,22.89
29,185,80,143,9,2,179,7.5,0.69,1481,6.29,1.5,1,80,25.22,22.89
29,179.8,70.6,134.8,11.15,3.2,165.65,5.3,0.6,1481,6.29,1.5,1,80,25.22,22.89
Firstly, I tried to investigate for possible collinearity among the variables.
library(car)
lily = read.csv("lily.csv",header=T)
scatterplotMatrix(lily,diag="boxplot")
FilL
, AntL
, AntW
, and HeiP
seem to be non-normal so I made log10
transformations. This seems to work fine. (And it is fine for you to educate me at this point if I am doing it wrong. I'd appreciate it very much.)
scatterplotMatrix(~Elev + pH + OM + P + K + SoilTemp + AirTemp +
AirTemp + Leaves + CorL + CorD + log10(FilL) + log10(log10(log10(AntL)+0.1)+0.1) +
log10(AntW) + StaL + StiW + log10(HeiP),data=lily,diag="boxplot")
I check for multicolinearity among the independent variables.
> cor(lily[,10:16])
Elev pH OM P K
Elev 1.00000000 0.48252995 -0.06601928 -0.56726786 -0.28159580
pH 0.48252995 1.00000000 -0.58587694 -0.81673123 -0.70434283
OM -0.06601928 -0.58587694 1.00000000 0.65931857 0.86478172
P -0.56726786 -0.81673123 0.65931857 1.00000000 0.79782480
K -0.28159580 -0.70434283 0.86478172 0.79782480 1.00000000
SoilTemp 0.14558365 0.01543524 -0.10436250 -0.05023853 -0.01041523
AirTemp 0.26450883 0.15711849 0.16862694 -0.09735977 0.11655030
SoilTemp AirTemp
Elev 0.14558365 0.26450883
pH 0.01543524 0.15711849
OM -0.10436250 0.16862694
P -0.05023853 -0.09735977
K -0.01041523 0.11655030
SoilTemp 1.00000000 0.83202496
AirTemp 0.83202496 1.00000000
Among the independent variables, pairs P
and pH
, K
and pH
, P
and K
, OM
and K
, and SoilTemp
and AirTemp
have strong collinearity.
I also checked for collinearity among the dependent variables although I don't have an idea if this is a alright.
> cor(lily[,1:9])
Leaves CorL CorD FilL AntL AntW
Leaves 1.000000000 0.44495257 0.17903019 0.5222644 0.1495016 0.004680606
CorL 0.444952572 1.00000000 0.51084625 0.1319070 0.2101801 0.097530007
CorD 0.179030187 0.51084625 1.00000000 0.2368117 0.3297344 0.376806953
FilL 0.522264352 0.13190704 0.23681171 1.0000000 0.3932006 0.284738542
AntL 0.149501570 0.21018008 0.32973443 0.3932006 1.0000000 0.796401542
AntW 0.004680606 0.09753001 0.37680695 0.2847385 0.7964015 1.000000000
StaL 0.416083096 0.06574503 0.23272070 0.7762797 0.2701401 0.318744025
StiW 0.194927129 -0.05594094 0.08322138 0.3752195 0.3755628 0.445964273
HeiP 0.577737137 0.17603412 0.13911530 0.6348948 0.4583508 0.254173681
StaL StiW HeiP
Leaves 0.41608310 0.19492713 0.5777371
CorL 0.06574503 -0.05594094 0.1760341
CorD 0.23272070 0.08322138 0.1391153
FilL 0.77627970 0.37521953 0.6348948
AntL 0.27014013 0.37556279 0.4583508
AntW 0.31874403 0.44596427 0.2541737
StaL 1.00000000 0.38306631 0.3794643
StiW 0.38306631 1.00000000 0.5039679
HeiP 0.37946433 0.50396793 1.0000000
From here, I can check for variance inflation and their inverses and possibly investigate interactions but I am really not sure now how to proceed or if it is alright at all to do these things in the multivariate case. And it seems to be a long way still to assessing the best multivariate model. In the case of the one dependent variable case, I can use the MuMIn
package to automate the determination of the best fit but it doesn't work in the multiple response case.
How do I proceed from this point? I will also appreciate it very much if you can point me to a good book or online material (preferably with applications in R).
Best Answer
To start with I will advice you fit different models with different endpoints.
Leaves CorL CorD FilL AntL AntW StaL StiW HeiP
. Why will you need a multiple endpoint model in the first place?. Among independent variables there is quite some multicollinearity like you have rightly said, is this something of concern?.Like you have indicated the VIF will help us with that info. Use the following R code to do VIF(variance inflation) regression before the with the selected variable you can use them for regression. Chose an appriopriate threshold for the VIF. I have used 5.`
I called you data set
dep
you can run it as followsthresh
is the threshold for VIF you can use 10 it depends on what you want. Here are the result of the good variables to use for regression independent of dependent variables you want to use.This should be a good place to start.