I can do use pls2 to predict multiple dependent variables from a matrix of predictor variables at once as follows:
library(pls)
sensory0 <- cbind(green = oliveoil$sensory[,'green'], syrup = oliveoil$sensory[,'syrup'])
oliveoil0 <- list(chemical = oliveoil$chemical, sensory = sensory0)
oopls0 <- pls::mvr(sensory ~chemical, data = oliveoil0, method = 'simpls', validation = 'LOO')
summary(oopls0)
Data: X dimension: 16 5 Y dimension: 16 2 Fit method: simpls Number of components considered: 5 VALIDATION: RMSEP Cross-validated using 16 leave-one-out segments. Response: green (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps CV 24.26 23.87 20.51 21.35 23.96 28.38 adjCV 24.26 23.79 20.41 21.20 23.70 27.98 Response: syrup (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps CV 3.166 2.134 2.319 2.476 2.937 3.066 adjCV 3.166 2.128 2.304 2.456 2.899 3.021 TRAINING: % variance explained 1 comps 2 comps 3 comps 4 comps 5 comps X 99.59 99.87 100.00 100.00 100.00 green 11.69 43.64 45.39 48.52 48.66 syrup 57.65 58.80 58.80 58.81 62.39
This ends up suggesting that I use two latent variables, and that we end up predicting about 44% of the variability in green and 59% of the variability in syrup.
What if though, one of the two y variables is a binary variable?
medsplit <- function(x){as.numeric(x > median(x))}
sensory2 <- cbind(green = oliveoil$sensory[,'green'], syrup = medsplit(oliveoil$sensory[,'syrup']))
oliveoil2 <- list(chemical = oliveoil$chemical, sensory = sensory2)
Data: X dimension: 16 5 Y dimension: 16 2 Fit method: simpls Number of components considered: 5 VALIDATION: RMSEP Cross-validated using 16 leave-one-out segments. Response: green (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps CV 24.26 23.86 20.51 21.35 23.96 28.38 adjCV 24.26 23.79 20.41 21.20 23.70 27.98 Response: syrup (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps CV 0.5333 0.4299 0.4251 0.4315 0.5002 0.5267 adjCV 0.5333 0.4288 0.4233 0.4293 0.4952 0.5208 TRAINING: % variance explained 1 comps 2 comps 3 comps 4 comps 5 comps X 99.59 99.87 100.00 100.00 100.00 green 11.70 43.64 45.39 48.52 48.66 syrup 37.79 45.74 47.38 47.46 47.78
In this case, it looks like I am getting out a similar answer and we predict about the same variance in "green" and slightly less than before with the categorical variable for "syrup". Is this a reasonable approach? Alternatively, do I need to do something to more explicitly handle the binary y variable?
I see the plsRglm
package does have an option for logistic PLS, but it seems to only handles one dependent variable rather than several. I suppose I could run two models, one to predict the binary variable and one to predict the continuous one. I prefer this pls-2 approach though for interpretability, since it shows how y variables together get predicted by x varaibles together, although maybe that's not a sufficient reason.
Thoughts? I guess my central question here is: What is the best way to use pls to approach a mix of binary and continuous variables?
Best Answer
Just throwing some ideas:
Let's say your categorical variable whose name is Cat has 4 levels: A, B, C, D. And you have 1 continuous variable whose name is Cont
I would create response matrix to be used in PLS2 such that for 3 samples: (first line is headers)
So the first sample has A as categorical response and 4.2 as continous response and so on. You can apply PLS2 to this response matrix.
For prediction, you can assign categorical variable as whichever the maximum of A, B, C, D prediction is.
Edit: I forgot that your discrete variable is binary. Then you have only A and B and same logic applies to your case too.