PLS regression relies on iterative algorithms (e.g., NIPALS, SIMPLS). Your description of the main ideas is correct: we seek one (PLS1, one response variable/multiple predictors) or two (PLS2, with different modes, multiple response variables/multiple predictors) vector(s) of weights, $u$ (and $v$), say, to form linear combination(s) of the original variable(s) such that the covariance between Xu and Y (Yv, for PLS2) is maximal. Let us focus on extracting the first pair of weights associated to the first component. Formally, the criterion to optimize reads
$$\max\text{cov}(Xu, Yv).\qquad (1)$$
In your case, $Y$ is univariate, so it amounts to maximize
$$\text{cov}(Xu, y)\equiv \text{Var}(Xu)^{1/2}\times\text{cor}(Xu, y)\times\text{Var}(y)^{1/2},\quad st. \|u\|=1.$$
Since $\text{Var}(y)$ does not depend on $u$, we have to maximise $\text{Var}(Xu)^{1/2}\times\text{cor}(Xu, y)$. Let's consider X=[x_1;x_2]
, where data are individually standardized (I initially made the mistake of scaling your linear combination instead of $x_1$ and $x_2$ separately!), so that $\text{Var}(x_1)=\text{Var}(x_2)=1$; however, $\text{Var}(Xu)\neq 1$ and depends on $u$. In conclusion, maximizing the correlation between the latent component and the response variable will not yield the same results.
I should thank Arthur Tenenhaus who pointed me in the right direction.
Using unit weight vectors is not restrictive and some packages (pls. regression
in plsgenomics, based on code from Wehrens's earlier package pls.pcr
) will return unstandardized weight vectors (but with latent components still of norm 1), if requested. But most of PLS packages will return standardized $u$, including the one you used, notably those implementing the SIMPLS or NIPALS algorithm; I found a good overview of both approaches in Barry M. Wise's presentation, Properties of Partial Least Squares (PLS) Regression, and differences between Algorithms, but the chemometrics vignette offers a good discussion too (pp. 26-29). Of particular importance as well is the fact that most PLS routines (at least the one I know in R) assume that you provide unstandardized variables because centering and/or scaling is handled internally (this is particularly important when doing cross-validation, for example).
Given the constraint $u'u=1$, the vector $u$ is found to be $$u=\frac{X'y}{\|X'y\|}.$$
Using a little simulation, it can be obtained as follows:
set.seed(101)
X <- replicate(2, rnorm(100))
y <- 0.6*X[,1] + 0.7*X[,2] + rnorm(100)
X <- apply(X, 2, scale)
y <- scale(y)
# NIPALS (PLS1)
u <- crossprod(X, y)
u <- u/drop(sqrt(crossprod(u))) # X weights
t <- X%*%u
p <- crossprod(X, t)/drop(crossprod(t)) # X loadings
You can compare the above results (u=[0.5792043;0.8151824]
, in particular) with what R packages would give. E.g., using NIPALS from the chemometrics package (another implementation that I know is available in the mixOmics package), we would obtain:
library(chemometrics)
pls1_nipals(X, y, 1)$W # X weights [0.5792043;0.8151824]
pls1_nipals(X, y, 1)$P # X loadings
Similar results would be obtained with plsr
and its default kernel PLS algorithm:
> library(pls)
> as.numeric(loading.weights(plsr(y ~ X, ncomp=1)))
[1] 0.5792043 0.8151824
In all cases, we can check that $u$ is of length 1.
Provided you change your function to optimize to one that reads
f <- function(u) cov(y, X%*%(u/sqrt(crossprod(u))))
and normalize u
afterwards (u <- u/sqrt(crossprod(u))
), you should be closer to the above solution.
Sidenote: As criterion (1) is equivalent to
$$\max u'X'Yv,$$
$u$ can be found as the left singular vector from the SVD of $X'Y$ corresponding to the largest eigenvalue:
svd(crossprod(X, y))$u
In the more general case (PLS2), a way to summarize the above is to say that the first PLS canonical vectors are the best approximation of the covariance matrix of X and Y in both directions.
References
- Tenenhaus, M (1999). L'approche PLS. Revue de Statistique Appliquée, 47(2), 5-40.
- ter Braak, CJF and de Jong, S (1993). The objective function of partial least squares regression. Journal of Chemometrics, 12, 41–54.
- Abdi, H (2010). Partial least squares regression and projection on latent structure regression (PLS Regression). Wiley Interdisciplinary Reviews: Computational Statistics, 2, 97-106.
- Boulesteix, A-L and Strimmer, K (2007). Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioinformatics, 8(1), 32-44.
Yes. PLS-DA is basically PLS regression where Y consists of categorical variables. Here is an example of Y matrix with 3 groups each consists of 2 samples (the first row is headers and is not involved in calculations).
![Example of Y matrix for PLS-DA](https://i.stack.imgur.com/ra6Zy.png)
After applying PLS-DA you can obtain a BETA matrix (if you are using SIMPLS algorithm, for example) whose number of columns equals to the number of groups for 2+ groups. There are, however, few differences of PLS-DA from logistic regression and some other classification methods. The predicted Y values might get out of the 0 to 1 range. So you may find values such as 1.1 and -0.3 etc...
Another property of PLS-DA is the sum of the predicted Y values in row is always equal to 1.
Assigning predicted samples to a group can be done in several ways. The most common one is assignation of the sample to the group having the highest value. An alternative approach (Bayesian) is fitting a normal distribution to the predictions of the training set and finding the threshold that minimizes the classification error. The samples, then, can be assigned to the group whose threshold value is exceeded.
Best Answer
Section 3.5.2 in The Elements of Statistical Learning is useful because it puts PLS regression in the right context (of other regularization methods), but is indeed very brief, and leaves some important statements as exercises. In addition, it only considers a case of a univariate dependent variable $\mathbf y$.
The literature on PLS is vast, but can be quite confusing because there are many different "flavours" of PLS: univariate versions with a single DV $\mathbf y$ (PLS1) and multivariate versions with several DVs $\mathbf Y$ (PLS2), symmetric versions treating $\mathbf X$ and $\mathbf Y$ equally and asymmetric versions ("PLS regression") treating $\mathbf X$ as independent and $\mathbf Y$ as dependent variables, versions that allow a global solution via SVD and versions that require iterative deflations to produce every next pair of PLS directions, etc. etc.
All of this has been developed in the field of chemometrics and stays somewhat disconnected from the "mainstream" statistical or machine learning literature.
The overview paper that I find most useful (and that contains many further references) is:
For a more theoretical discussion I can further recommend:
A short primer on PLS regression with univariate $y$ (aka PLS1, aka SIMPLS)
The goal of regression is to estimate $\beta$ in a linear model $y=X\beta + \epsilon$. The OLS solution $\beta=(\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf y$ enjoys many optimality properties but can suffer from overfitting. Indeed, OLS looks for $\beta$ that yields the highest possible correlation of $\mathbf X \beta$ with $\mathbf y$. If there is a lot of predictors, then it is always possible to find some linear combination that happens to have a high correlation with $\mathbf y$. This will be a spurious correlation, and such $\beta$ will usually point in a direction explaining very little variance in $\mathbf X$. Directions explaining very little variance are often very "noisy" directions. If so, then even though on training data OLS solution performs great, on testing data it will perform much worse.
In order to prevent overfitting, one uses regularization methods that essentially force $\beta$ to point into directions of high variance in $\mathbf X$ (this is also called "shrinkage" of $\beta$; see Why does shrinkage work?). One such method is principal component regression (PCR) that simply discards all low-variance directions. Another (better) method is ridge regression that smoothly penalizes low-variance directions. Yet another method is PLS1.
PLS1 replaces the OLS goal of finding $\beta$ that maximizes correlation $\operatorname{corr}(\mathbf X \beta, \mathbf y)$ with an alternative goal of finding $\beta$ with length $\|\beta\|=1$ maximizing covariance $$\operatorname{cov}(\mathbf X \beta, \mathbf y)\sim\operatorname{corr}(\mathbf X \beta, \mathbf y)\cdot\sqrt{\operatorname{var}(\mathbf X \beta)},$$ which again effectively penalizes directions of low variance.
Finding such $\beta$ (let's call it $\beta_1$) yields the first PLS component $\mathbf z_1 = \mathbf X \beta_1$. One can further look for the second (and then third, etc.) PLS component that has the highest possible covariance with $\mathbf y$ under the constraint of being uncorrelated with all the previous components. This has to be solved iteratively, as there is no closed-form solution for all components (the direction of the first component $\beta_1$ is simply given by $\mathbf X^\top \mathbf y$ normalized to unit length). When the desired number of components is extracted, PLS regression discards the original predictors and uses PLS components as new predictors; this yields some linear combination of them $\beta_z$ that can be combined with all $\beta_i$ to form the final $\beta_\mathrm{PLS}$.
Note that:
All that being said, I am not aware of any practical advantages of PLS1 regression over ridge regression (while the latter does have lots of advantages: it is continuous and not discrete, has analytical solution, is much more standard, allows kernel extensions and analytical formulas for leave-one-out cross-validation errors, etc. etc.).
Quoting from Frank & Friedman:
They also conduct an extensive simulation study and conclude (emphasis mine):
Update: In the comments @cbeleites (who works in chemometrics) suggests two possible advantages of PLS over RR:
An analyst can have an a priori guess as to how many latent components should be present in the data; this will effectively allow to set a regularization strength without doing cross-validation (and there might not be enough data to do a reliable CV). Such an a priori choice of $\lambda$ might be more problematic in RR.
RR yields one single linear combination $\beta_\mathrm{RR}$ as an optimal solution. In contrast PLS with e.g. five components yields five linear combinations $\beta_i$ that are then combined to predict $y$. Original variables that are strongly inter-correlated are likely to be combined into a single PLS component (because combining them together will increase the explained variance term). So it might be possible to interpret the individual PLS components as some real latent factors driving $y$. The claim is that it is easier to interpret $\beta_1, \beta_2,$ etc., as opposed to the joint $\beta_\mathrm{PLS}$. Compare this with PCR where one can also see as an advantage that individual principal components can potentially be interpreted and assigned some qualitative meaning.