Solved – Regression by multiple dependent variables with constraints & feature selection

feature selectionmultivariate regression

I have a data set of 1000 records. Each record has three dependent variables $y_1, y_2, y_3$ and 100 independent variables $x_1,…,x_{100}$, where the dependent variable $y_i$ satisfies:

$0\le y_i \le1$
$y_1 + y_2 + y_3 =1$

I.e. $y_i$ represent the probability of a observation belonging to one of the three classes.

Q1: How can I build a (multivariate linear) model using $x_1,…,x_{100}$ to predict ($y_1,y_2,y_3$)? Is any R package available? How can I implement it?

Q2: Since there are so many independent variables $x_1,…,x_{100}$ (features), is it possible to do feature selection using LASSO, SCAD, or elastic net for this multivariate linear model using tools in glmnet package?

Best Answer

You can try the PLS regression which is fit for datasets with multiple dependant variables.

Q1 : For a multivariate linear model, combine it with the clusterwise approach.

Q2 : PLS is built to work with many independant variables, even with more than observations.

Here there is a R package available.

We're working on another open source implementation which will improve it, it will be available soon.

Related Solutions

Solved – Bayesian approach and least-squares approach to multivariate regression with structural design

Partial answer to question 1: I have not really checked but I think that one simply obtains a conjugate family for model $(**)$ by considering the normal distributions on the within subspace (the range of the $X$ matrix) as prior distributions (conditionally to $\Sigma$) for the within parameter ${\boldsymbol \beta}$. Practically it suffices to project the posterior distributions of model $(*)$ on the within subspace, and that should be the Jeffreys prior of model $(**)$ when taking the Jeffreys prior for model $(*)$.

Question 1bis: I am not entirely satisfied by the previous answer. What about the posterior distribution of $\Sigma$ ? (I will think later about this question because I'm too hungry now)

Partial answer to question 1bis: Indeed, the partial answer to question 1 seems to be wrong. The Jeffreys posterior for the covariance matrix the multinormal model is inverse-Wishart with a mean related to the frequentist estimate $\hat\Sigma$ of $\Sigma$ in model $(*)$. Here we should have instead the variation around the fitted values of each response (see question 2bis), which are not the means of each response when there is a within-design.

Partial answer to question 2: Since the within parameters are simply estimated by applying a linear projection to the estimated parameters of model $(*)$, the least-squares theory of model $(**)$ is inherited from the least-squares theory of model $(*)$. No ? And about $\hat\Sigma$ in model $(**)$ it should be $\frac{1}{n-1}({\boldsymbol y} - {\boldsymbol 1}_n \otimes X\hat{\boldsymbol \beta}){({\boldsymbol y} - {\boldsymbol 1}_n \otimes X\hat{\boldsymbol \beta})}'$ where $X \hat{\boldsymbol \beta} = P \hat{\boldsymbol \mu}^*$, $\hat{\boldsymbol \mu}^*$ is the estimate of ${\boldsymbol \mu}$ in model $(*)$, and $P$ is the projection on the range of $X$. In the example below, gls() returns $\frac{1}{n}({\boldsymbol y} - {\boldsymbol 1}_n \otimes X\hat{\boldsymbol \beta}){({\boldsymbol y} - {\boldsymbol 1}_n \otimes X\hat{\boldsymbol \beta})}'$ as the maximum likelihood estimate of $\Sigma$.

Below is a R code for the Bayesian analysis using the bayesm package, with the timepoints example of my OP.

################################
####     SIMULATED DATA     ####
################################
library(mvtnorm)

m <- 3 #  number of timepoints
n <- 24 #  number of individuals
timepoints <- c(1,2,3) 
dat <- data.frame(Subject=gl(n,m), timepoint=rep(timepoints,n))  

# model parameters 
alpha <- 0  # within intercept
beta <- 1   # within slope
Mean <- alpha+beta*timepoints
Sigma <- matrix(c(4,1,1,1,2,1,1,1,2), ncol=3)/2

# simulated responses
set.seed(666)
dat$response <- c(t(rmvnorm(n=n, mean=Mean, sigma=Sigma)))

# data in wide format
wdat <- reshape(dat, direction="wide", v.names="response", idvar="Subject", timevar="timepoint")


################################
## BAYESIAN ANALYSIS          ##
## WITH JEFFREYS PRIOR        ##
## IGNORING THE WITHIN DESIGN ##
################################
library(bayesm)  

 ##### set data parameters #####
 Y <- as.matrix(wdat[c("response.1","response.2","response.3")]) 
 X <- model.matrix(Y~1) # matrix of covariates
 p <- dim(X)[2] # number of covariates
 ##### (noninformative) prior parameters #####
 theta0 <- matrix(rep(0,p*m), ncol=m)
 A0 <- 0.0001*diag(p)
 nu0 <- 0
 Omega0 <- matrix(rep(0,m*m),ncol=m)
 ##### posterior simulations #####
 n.sims <- 10000 # number of simulations
 Beta.sims <- array(NA, dim=c(p,m,n.sims)) # for storing simulations of beta
 Sigma.sims <- array(NA, dim=c(m,m,n.sims)) # for storing simulations of Sigma
 for(sim in 1:n.sims){
    reg <- rmultireg(Y=Y,X=X,Bbar=theta0,A=A0,nu=nu0,V=Omega0)
    Beta.sims[,,sim] <- reg$B
     	Sigma.sims[,,sim] <- reg$Sigma
 }

> # compare expected posterior means with true means:
> rowMeans(Beta.sims, dims = 2)
         [,1]     [,2]     [,3]
[1,] 1.053459 1.956234 2.861015
> Mean
[1] 1 2 3
> # theoretically these are also the frequentist estimates:
> colMeans(Y)
response.1 response.2 response.3 
  1.045015   1.956583   2.858293 

> # compare posterior median Sigma with true Sigma:
> apply(Sigma.sims, c(1, 2), quantile, probs = 0.5)
          [,1]      [,2]      [,3]
[1,] 2.5455707 0.6033634 0.9018785
[2,] 0.6033634 1.4969095 0.6485045
[3,] 0.9018785 0.6485045 1.0620879
> Sigma
     [,1] [,2] [,3]
[1,]  2.0  0.5  0.5
[2,]  0.5  1.0  0.5
[3,]  0.5  0.5  1.0

##############################################
## PROJECTIONS OF THE POSTERIOR SIMULATIONS ##
## ON THE WITHIN-DESING SUBSPACE            ##
##############################################

# projection matrix:
X <- model.matrix(~1+timepoints)
P <- solve((t(X)%*%X))%*%t(X)
rownames(P) <- c("alpha","beta")
# projections of the posterior means are the posterior within parameters:
sims <- P%*%Beta.sims[1,,]

> # posterior summaries
> summary(data.frame(alpha=sims["alpha",], beta=sims["beta",]))
     alpha              beta       
 Min.   :-1.5637   Min.   :0.3387  
 1st Qu.:-0.1171   1st Qu.:0.8134  
 Median : 0.1472   Median :0.9038  
 Mean   : 0.1493   Mean   :0.9038  
 3rd Qu.: 0.4106   3rd Qu.:0.9953  
 Max.   : 2.0783   Max.   :1.5786

> # frequentist estimates
> P%*%colMeans(Y)
           [,1]
alpha 0.1400198
beta  0.9066386

# note: maximum likelihood estimates from gls() differ from "frequentist estimates" 
library(nlme)
fit <-  gls(response ~ timepoint,  data=dat, method="ML",  
  correlation=corSymm(form=  ~ timepoint | Subject),  
    weights=varIdent(form = ~1 | timepoint))      
> coef(fit)
(Intercept)   timepoint 
  0.1430734   0.9054123 

> ## the ML estimate of Sigma is the mean variation around the fitted values:
> # calculate this variation 
> alpha.f <- (P%*%colMeans(Y))["alpha",]
> beta.f <- (P%*%colMeans(Y))["beta",]
> fitted.frequentist <- alpha.f + beta.f*timepoints
> crossprod(Y-rep(1,n)%*%t(fitted.frequentist)) 
           response.1 response.2 response.3
response.1   54.45126   13.18208   19.65210
response.2   13.18208   32.16598   14.20923
response.3   19.65210   14.20923   22.76616
> # compare:
> getVarCov(fit) * n
Marginal variance covariance matrix
       [,1]   [,2]   [,3]
[1,] 54.451 13.182 19.652
[2,] 13.182 32.166 14.209
[3,] 19.652 14.209 22.766
  Standard Deviations: 7.3791 5.6715 4.7714

Update

After these thoughts, I finally conjecture that the Jeffreys posterior is given by: \begin{align*} \Sigma \mid {\boldsymbol y} & \sim {\cal IW}_m(n, \Omega_n^{-1}) \\ {\boldsymbol \mu} \mid \Sigma, {\boldsymbol y} & \sim {\cal N}_{n \times m}\left(X\hat{\boldsymbol \beta}, \Sigma \otimes J_n^{-1} \right) \end{align*} with \begin{align*} J_n & = {(1,1\ldots,1)}'(1,1\ldots,1) \\ \Omega_n & = ({\boldsymbol y} -{\boldsymbol 1}_n \otimes X\hat{\boldsymbol \beta}){({\boldsymbol y} - {\boldsymbol 1}_n \otimes X\hat{\boldsymbol \beta})}' \end{align*} And then the posterior distributions of the within parameters are obtained by applying the projection $P$ on the within subspace to ${\boldsymbol \mu}$.

When there is no within design this is the Jeffreys posterior for the multinormal model.

Now I don't have the courage to check my conjecture.

Solved – Finding the best linear model for each response variable in multivariate multiple regression using R

To start with I will advice you fit different models with different endpoints. Leaves CorL CorD FilL AntL AntW StaL StiW HeiP. Why will you need a multiple endpoint model in the first place?. Among independent variables there is quite some multicollinearity like you have rightly said, is this something of concern?.Like you have indicated the VIF will help us with that info. Use the following R code to do VIF(variance inflation) regression before the with the selected variable you can use them for regression. Chose an appriopriate threshold for the VIF. I have used 5.

vif_func<-function(in_frame,thresh=10,trace=T){

  require(fmsb)

  if(class(in_frame) != "data.frame") in_frame<-data.frame(in_frame)

  #get initial vif value for all comparisons of variables
  vif_init<-NULL
  for(val in names(in_frame)){
    form_in<-formula(paste(val," ~ ."))
    vif_init<-rbind(vif_init,c(val,VIF(lm(form_in,data=in_frame))))



}
  vif_max<-max(as.numeric(vif_init[,2]))

  if(vif_max < thresh){
    if(trace==T){ #print output of each iteration
      prmatrix(vif_init,collab=c("var","vif"),rowlab=rep("",nrow(vif_init)),quote=F)
      cat("\n")
      cat(paste("All variables have VIF < ", thresh,", max VIF ",round(vif_max,2), sep=""),"\n\n")
    }
    return(names(in_frame))
  }

  else{

in_dat<-in_frame

#backwards selection of explanatory variables, stops when all VIF values are below "thresh"
while(vif_max >= thresh){

  vif_vals<-NULL

  for(val in names(in_dat)){
    form_in<-formula(paste(val," ~ ."))
    vif_add<-VIF(lm(form_in,data=in_dat))
    vif_vals<-rbind(vif_vals,c(val,vif_add))
  }
  max_row<-which(vif_vals[,2] == max(as.numeric(vif_vals[,2])))[1]

  vif_max<-as.numeric(vif_vals[max_row,2])

  if(vif_max<thresh) break

  if(trace==T){ #print output of each iteration
    prmatrix(vif_vals,collab=c("var","vif"),rowlab=rep("",nrow(vif_vals)),quote=F)
    cat("\n")
    cat("removed: ",vif_vals[max_row,1],vif_max,"\n\n")
    flush.console()
  }

  in_dat<-in_dat[,!names(in_dat) %in% vif_vals[max_row,1]]

}

return(names(in_dat))



 }

}`

I called you data set dep you can run it as follows

keep.dat<-vif_func(in_frame=dep,thresh=5,trace=T)

thresh is the threshold for VIF you can use 10 it depends on what you want. Here are the result of the good variables to use for regression independent of dependent variables you want to use.

var      vif             


Elev     2.34467892681204
 pH       4.82456111736694
 OM       9.60381685354609
 P        6.09927871325235
 K        6.55185481475336
 SoilTemp 9.76265101226991
 AirTemp  10.5786139945657

removed:  AirTemp 10.57861 

 var      vif             
 Elev     2.10463008453203
 pH       3.12149022393123
 OM       5.09603402410369
 P        5.56333186256743
 K        6.54812601407721
 SoilTemp 1.11028184053362

removed:  K 6.548126

This should be a good place to start.

Best Answer

Related Solutions

Solved – Bayesian approach and least-squares approach to multivariate regression with structural design

Update

Solved – Finding the best linear model for each response variable in multivariate multiple regression using R

Related Question