Solved – partial least squares with two dependent variables, one continuous and the other binary

binary datacontinuous datapartial least squares

I can do use pls2 to predict multiple dependent variables from a matrix of predictor variables at once as follows:

library(pls)
sensory0 <- cbind(green = oliveoil$sensory[,'green'], syrup = oliveoil$sensory[,'syrup'])
oliveoil0 <- list(chemical = oliveoil$chemical, sensory = sensory0)
oopls0 <- pls::mvr(sensory ~chemical, data = oliveoil0, method = 'simpls', validation = 'LOO')
summary(oopls0)
   Data:  X dimension: 16 5 
          Y dimension: 16 2 Fit method: simpls Number of components considered: 5

   VALIDATION: RMSEP Cross-validated using 16 leave-one-out segments.

   Response: green 
                (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps
   CV           24.26    23.87    20.51    21.35    23.96    28.38
   adjCV        24.26    23.79    20.41    21.20    23.70    27.98

   Response: syrup 
            (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps 
   CV           3.166    2.134    2.319    2.476    2.937    3.066 
   adjCV        3.166    2.128    2.304    2.456    2.899    3.021

   TRAINING: % variance explained

   1 comps  2 comps  3 comps  4 comps  5 comps 
   X        99.59    99.87   100.00   100.00   100.00
   green    11.69    43.64    45.39    48.52    48.66
   syrup    57.65    58.80    58.80    58.81    62.39

This ends up suggesting that I use two latent variables, and that we end up predicting about 44% of the variability in green and 59% of the variability in syrup.

What if though, one of the two y variables is a binary variable?

medsplit <- function(x){as.numeric(x > median(x))}
sensory2 <- cbind(green = oliveoil$sensory[,'green'], syrup = medsplit(oliveoil$sensory[,'syrup']))
oliveoil2 <- list(chemical = oliveoil$chemical, sensory = sensory2)
Data:     X dimension: 16 5
          Y dimension: 16 2 
Fit method: simpls 
Number of components considered: 5

VALIDATION: RMSEP Cross-validated using 16 leave-one-out segments.

Response: green 
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps 
CV           24.26    23.86    20.51    21.35    23.96    28.38
adjCV        24.26    23.79    20.41    21.20    23.70    27.98

Response: syrup 
         (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps
 CV          0.5333   0.4299   0.4251   0.4315   0.5002   0.5267 
 adjCV       0.5333   0.4288   0.4233   0.4293   0.4952   0.5208

TRAINING: % variance explained
         1 comps  2 comps  3 comps  4 comps  5 comps 
X        99.59    99.87   100.00   100.00   100.00 
green    11.70    43.64    45.39    48.52    48.66 
syrup    37.79    45.74    47.38    47.46    47.78

In this case, it looks like I am getting out a similar answer and we predict about the same variance in "green" and slightly less than before with the categorical variable for "syrup". Is this a reasonable approach? Alternatively, do I need to do something to more explicitly handle the binary y variable?

I see the plsRglm package does have an option for logistic PLS, but it seems to only handles one dependent variable rather than several. I suppose I could run two models, one to predict the binary variable and one to predict the continuous one. I prefer this pls-2 approach though for interpretability, since it shows how y variables together get predicted by x varaibles together, although maybe that's not a sufficient reason.

Thoughts? I guess my central question here is: What is the best way to use pls to approach a mix of binary and continuous variables?

Best Answer

Just throwing some ideas:

Let's say your categorical variable whose name is Cat has 4 levels: A, B, C, D. And you have 1 continuous variable whose name is Cont

I would create response matrix to be used in PLS2 such that for 3 samples: (first line is headers)

A B C D Cont
- - - - ----
1 0 0 0 4.2
0 1 0 0 3.3
0 0 0 1 1.9

So the first sample has A as categorical response and 4.2 as continous response and so on. You can apply PLS2 to this response matrix.

For prediction, you can assign categorical variable as whichever the maximum of A, B, C, D prediction is.

Edit: I forgot that your discrete variable is binary. Then you have only A and B and same logic applies to your case too.

Related Question