Solved – Model assumptions of partial least squares (PLS) regression

assumptionspartial least squaresregression

I am trying to find information regarding the assumptions of PLS regression (single $y$). I am especially interested in a comparison of the assumptions of PLS with regards to those of OLS regression.

I have read/skimmed through a great deal of literature on the topic of PLS; papers by Wold (Svante and Herman), Abdi, and many others but haven't found a satisfactory source.

Wold et al. (2001) PLS-regression: a basic tool of chemometrics does mention assumptions of PLS, but it only mentions that

  1. Xs need not be independent,
  2. the system is a function of a few underlying latent variables,
  3. the system should exhibit homogeneity throughout the analytical process, and
  4. measurement error in $X$ is acceptable.

There is no mention of any requirements of the observed data, or model residuals. Does anyone know of a source that addresses any of this? Considering underlying math is analogous to PCA (with goal of maximizing covariance between $y$ and $X$) is multivariate normality of $(y, X)$ an assumption? Do model residuals need to exhibit homogeneity of variance?

I also believe I read somewhere that the observations need not be independent; what does this mean in terms of repeated measure studies?

Best Answer

When we say that the standard OLS regression has some assumptions, we mean that these assumptions are needed to derive some desirable properties of the OLS estimator such as e.g. that it is the best linear unbiased estimator -- see Gauss-Markov theorem and an excellent answer by @mpiktas in What is a complete list of the usual assumptions for linear regression? No assumptions are needed in order to simply regress $y$ on $X$. Assumptions only appear in the context of optimality statements.

More generally, "assumptions" is something that only a theoretical result (theorem) can have.

Similarly for PLS regression. It is always possible to use PLS regression to regress $y$ on $X$. So when you ask what are the assumptions of PLS regression, what are the optimality statements that you think about? In fact, I am not aware of any. PLS regression is one form of shrinkage regularization, see my answer in Theory behind partial least squares regression for some context and overview. Regularized estimators are biased, so no amount of assumptions will e.g. prove the unbiasedness.

Moreover, the actual outcome of PLS regression depends on how many PLS components are included in the model, which acts as a regularization parameter. Talking about any assumptions only makes sense if the procedure for selecting this parameter is completely specified (and it usually isn't). So I don't think there are any optimality results for PLS at all, which means that PLS regression has no assumptions. I think the same is true for any other penalized regression methods such as principal component regression or ridge regression.

Update: I have expanded this argument in my answer to What are the assumptions of ridge regression and how to test them?

Of course, there can still be rules of thumb that say when PLS regression is likely to be useful and when not. Please see my answer linked above for some discussion; experienced practitioners of PLSR (I am not one of them) could certainly say more to that.

Related Question