When we say that the standard OLS regression has some assumptions, we mean that these assumptions are needed to derive some desirable properties of the OLS estimator such as e.g. that it is the best linear unbiased estimator -- see Gauss-Markov theorem and an excellent answer by @mpiktas in What is a complete list of the usual assumptions for linear regression? No assumptions are needed in order to simply regress $y$ on $X$. Assumptions only appear in the context of optimality statements.
More generally, "assumptions" is something that only a theoretical result (theorem) can have.
Similarly for PLS regression. It is always possible to use PLS regression to regress $y$ on $X$. So when you ask what are the assumptions of PLS regression, what are the optimality statements that you think about? In fact, I am not aware of any. PLS regression is one form of shrinkage regularization, see my answer in Theory behind partial least squares regression for some context and overview. Regularized estimators are biased, so no amount of assumptions will e.g. prove the unbiasedness.
Moreover, the actual outcome of PLS regression depends on how many PLS components are included in the model, which acts as a regularization parameter. Talking about any assumptions only makes sense if the procedure for selecting this parameter is completely specified (and it usually isn't). So I don't think there are any optimality results for PLS at all, which means that PLS regression has no assumptions. I think the same is true for any other penalized regression methods such as principal component regression or ridge regression.
Update: I have expanded this argument in my answer to What are the assumptions of ridge regression and how to test them?
Of course, there can still be rules of thumb that say when PLS regression is likely to be useful and when not. Please see my answer linked above for some discussion; experienced practitioners of PLSR (I am not one of them) could certainly say more to that.
Section 3.5.2 in The Elements of Statistical Learning is useful because it puts PLS regression in the right context (of other regularization methods), but is indeed very brief, and leaves some important statements as exercises. In addition, it only considers a case of a univariate dependent variable $\mathbf y$.
The literature on PLS is vast, but can be quite confusing because there are many different "flavours" of PLS: univariate versions with a single DV $\mathbf y$ (PLS1) and multivariate versions with several DVs $\mathbf Y$ (PLS2), symmetric versions treating $\mathbf X$ and $\mathbf Y$ equally and asymmetric versions ("PLS regression") treating $\mathbf X$ as independent and $\mathbf Y$ as dependent variables, versions that allow a global solution via SVD and versions that require iterative deflations to produce every next pair of PLS directions, etc. etc.
All of this has been developed in the field of chemometrics and stays somewhat disconnected from the "mainstream" statistical or machine learning literature.
The overview paper that I find most useful (and that contains many further references) is:
For a more theoretical discussion I can further recommend:
A short primer on PLS regression with univariate $y$ (aka PLS1, aka SIMPLS)
The goal of regression is to estimate $\beta$ in a linear model $y=X\beta + \epsilon$. The OLS solution $\beta=(\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf y$ enjoys many optimality properties but can suffer from overfitting. Indeed, OLS looks for $\beta$ that yields the highest possible correlation of $\mathbf X \beta$ with $\mathbf y$. If there is a lot of predictors, then it is always possible to find some linear combination that happens to have a high correlation with $\mathbf y$. This will be a spurious correlation, and such $\beta$ will usually point in a direction explaining very little variance in $\mathbf X$. Directions explaining very little variance are often very "noisy" directions. If so, then even though on training data OLS solution performs great, on testing data it will perform much worse.
In order to prevent overfitting, one uses regularization methods that essentially force $\beta$ to point into directions of high variance in $\mathbf X$ (this is also called "shrinkage" of $\beta$; see Why does shrinkage work?). One such method is principal component regression (PCR) that simply discards all low-variance directions. Another (better) method is ridge regression that smoothly penalizes low-variance directions. Yet another method is PLS1.
PLS1 replaces the OLS goal of finding $\beta$ that maximizes correlation $\operatorname{corr}(\mathbf X \beta, \mathbf y)$ with an alternative goal of finding $\beta$ with length $\|\beta\|=1$ maximizing covariance $$\operatorname{cov}(\mathbf X \beta, \mathbf y)\sim\operatorname{corr}(\mathbf X \beta, \mathbf y)\cdot\sqrt{\operatorname{var}(\mathbf X \beta)},$$ which again effectively penalizes directions of low variance.
Finding such $\beta$ (let's call it $\beta_1$) yields the first PLS component $\mathbf z_1 = \mathbf X \beta_1$. One can further look for the second (and then third, etc.) PLS component that has the highest possible covariance with $\mathbf y$ under the constraint of being uncorrelated with all the previous components. This has to be solved iteratively, as there is no closed-form solution for all components (the direction of the first component $\beta_1$ is simply given by $\mathbf X^\top \mathbf y$ normalized to unit length). When the desired number of components is extracted, PLS regression discards the original predictors and uses PLS components as new predictors; this yields some linear combination of them $\beta_z$ that can be combined with all $\beta_i$ to form the final $\beta_\mathrm{PLS}$.
Note that:
- If all PLS1 components are used, then PLS will be equivalent to OLS. So the number of components serves as a regularization parameter: the lower the number, the stronger the regularization.
- If the predictors $\mathbf X$ are uncorrelated and all have the same variance (i.e. $\mathbf X$ has been whitened), then there is only one PLS1 component and it is equivalent to OLS.
- Weight vectors $\beta_i$ and $\beta_j$ for $i\ne j$ are not going to be orthogonal, but will yield uncorrelated components $\mathbf z_i=\mathbf X \beta_i$ and $\mathbf z_j=\mathbf X \beta_j$.
All that being said, I am not aware of any practical advantages of PLS1 regression over ridge regression (while the latter does have lots of advantages: it is continuous and not discrete, has analytical solution, is much more standard, allows kernel extensions and analytical formulas for leave-one-out cross-validation errors, etc. etc.).
Quoting from Frank & Friedman:
RR, PCR, and PLS are seen in Section 3 to operate in a similar fashion. Their principal goal is to shrink the solution coefficient vector away from the OLS solution toward directions in the predictor-variable space of
larger sample spread. PCR and PLS are seen to shrink more heavily away
from the low spread directions than RR, which provides the optimal shrinkage (among linear estimators) for an equidirection prior. Thus
PCR and PLS make the assumption that the truth is likely to have particular preferential alignments with the high spread directions of the
predictor-variable (sample) distribution. A somewhat surprising result
is that PLS (in addition) places increased probability mass on the true
coefficient vector aligning with the $K$th principal component direction,
where $K$ is the number of PLS components used, in fact expanding the
OLS solution in that direction.
They also conduct an extensive simulation study and conclude (emphasis mine):
For the situations covered by this simulation study, one can conclude
that all of the biased methods (RR, PCR, PLS, and VSS) provide
substantial improvement over OLS. [...] In all situations, RR dominated
all of the other methods studied. PLS usually did almost as well as RR
and usually outperformed PCR, but not by very much.
Update: In the comments @cbeleites (who works in chemometrics) suggests two possible advantages of PLS over RR:
An analyst can have an a priori guess as to how many latent components should be present in the data; this will effectively allow to set a regularization strength without doing cross-validation (and there might not be enough data to do a reliable CV). Such an a priori choice of $\lambda$ might be more problematic in RR.
RR yields one single linear combination $\beta_\mathrm{RR}$ as an optimal solution. In contrast PLS with e.g. five components yields five linear combinations $\beta_i$ that are then combined to predict $y$. Original variables that are strongly inter-correlated are likely to be combined into a single PLS component (because combining them together will increase the explained variance term). So it might be possible to interpret the individual PLS components as some real latent factors driving $y$. The claim is that it is easier to interpret $\beta_1, \beta_2,$ etc., as opposed to the joint $\beta_\mathrm{PLS}$. Compare this with PCR where one can also see as an advantage that individual principal components can potentially be interpreted and assigned some qualitative meaning.
Best Answer
I think this is the recipe for overfitting. If you are after a predictive model and apply this methodology, you will end up with the variable(s) that explains your training set very well, however this set of variables is NOT guaranteed to perform well on any other data, such as an independent test set that is not used for training. Also, If you try to select variables based on their performances on both training and validation set, you will end up overfitting to both sets.
If you scale your data(0 mean, and 1 std for each variable) and apply PLS to that data, the obtained beta vector which is the vector/matrix of coefficients for each variable is related with the contribution of corresponding variable. I think this logic can be applied to all regression models. PLS also have the advantage of avoiding over-fitting, if you choose the correct number of components with using RMSEP values obtained with LOOCV for each number of components(it is also called Latent Variables), for example. You can also compare VIP scores for each variable. Here is the article about it:
https://doi.org/10.1016/j.chemolab.2012.07.010
There are alternatives namely Ridge and LASSO; the former mainly aims to avoid overfitting where as the latter is used for variable selection, PLS can be specifically useful when some combination of the correlated variables may carry a meaning, that is usually reflected to number of components. All in all, I would avoid stepwise-regression and stick with one of these methods.