Solved – Is the PLS-DA approach for categorical variables the same as that used for PLS regression

classificationmachine learningpartial least squaresregression

I understand the approach used for partial least squares for regression (PLS regression) where the PLS components are chosen such that the correlation between the scores of the PLS components of the independent variables and the scores of PLS components of the dependent variables is maximized.

I understand the approach for regression when the dependent variables are continuous. In case the dependent variable is categorical then I learned that the approach is termed partial least squares discriminant analysis (PLS-DA). Is that true?

Is it the same thought process as in PLS regression, except that in PLS-DA the dependent variable would have just two values (for binary classification) and we still go ahead and maximize the covariances across two sets of PLS components?

Best Answer

Yes. PLS-DA is basically PLS regression where Y consists of categorical variables. Here is an example of Y matrix with 3 groups each consists of 2 samples (the first row is headers and is not involved in calculations).

Example of Y matrix for PLS-DA

After applying PLS-DA you can obtain a BETA matrix (if you are using SIMPLS algorithm, for example) whose number of columns equals to the number of groups for 2+ groups. There are, however, few differences of PLS-DA from logistic regression and some other classification methods. The predicted Y values might get out of the 0 to 1 range. So you may find values such as 1.1 and -0.3 etc...

Another property of PLS-DA is the sum of the predicted Y values in row is always equal to 1.

Assigning predicted samples to a group can be done in several ways. The most common one is assignation of the sample to the group having the highest value. An alternative approach (Bayesian) is fitting a normal distribution to the predictions of the training set and finding the threshold that minimizes the classification error. The samples, then, can be assigned to the group whose threshold value is exceeded.