In partial least squares regression, what is the difference between the regression coefficients and the loadings for each independent variable in each component? Specifically, I understand in evety component, each of the independent variables has a coresponding loading. Does each variable also have a regression coefficient? What is the relationship between the loading vector and the coefficients?
Solved – Partial Least Squares regression – coefficients vs loadings
partial least squares
Related Solutions
I would like to answer this question, largely based on the historical perspective, which is quite interesting. Herman Wold, who invented partial least squares (PLS) approach, hasn't started using term PLS (or even mentioning term partial) right away. During the initial period (1966-1969), he referred to this approach as NILES - abbreviation of the term and title of his initial paper on this topic Nonlinear Estimation by Iterative Least Squares Procedures, published in 1966.
As we can see, procedures that later will be called partial, have been referred to as iterative, focusing on the iterative nature of the procedure of estimating weights and latent variables (LVs). The "least squares" term comes from using ordinary least squares (OLS) regression to estimate other unknown parameters of a model (Wold, 1980). It seems that the term "partial" has its roots in the NILES procedures, which implemented "the idea of split the parameters of a model into subsets so they can be estimated in parts" (Sanchez, 2013, p. 216; emphasis mine).
The first use of the term PLS has occurred in the paper Nonlinear iterative partial least squares (NIPALS) estimation procedures, which publication marks next period of PLS history - the NIPALS modeling period. 1970s and 1980s become the soft modeling period, when, influenced by Karl Joreskog's LISREL approach to SEM, Wold transforms NIPALS approach into soft modeling, which essentially has formed the core of the modern PLS approach (the term PLS becomes mainstream in the end of 1970s). 1990s, the next period in PLS history, which Sanchez (2013) calls "gap" period, is marked largely by decreasing of its use. Fortunately, starting from 2000s (consolidation period), PLS enjoyed its return as a very popular approach to SEM analysis, especially in social sciences.
UPDATE (in response to amoeba's comment):
- Perhaps, Sanchez's wording is not ideal in the phrase that I've cited. I think that "estimated in parts" applies to latent blocks of variables. Wold (1980) describes the concept in detail.
- You're right that NIPALS was originally developed for PCA. The confusion stems from the fact that there exist both linear PLS and nonlinear PLS approaches. I think that Rosipal (2011) explains the differences very well (at least, this is the best explanation that I've seen so far).
UPDATE 2 (further clarification):
In response to concerns, expressed in amoeba's answer, I'd like to clarify some things. It seems to me that we need to distinguish the use of the word "partial" between NIPALS and PLS. That creates two separate questions about 1) the meaning of "partial" in NIPALS and 2) the meaning of "partial" in PLS (that's the original question by Phil2014). While I'm not sure about the former, I can offer further clarification about the latter.
According to Wold, Sjöström and Eriksson (2001),
The "partial" in PLS indicates that this is a partial regression, since ...
In other words, "partial" stems from the fact that data decomposition by NIPALS algorithm for PLS may not include all components, hence "partial". I suspect that the same reason applies to NIPALS in general, if it's possible to use the algorithm on "partial" data. That would explain "P" in NIPALS.
In terms of using the word "nonlinear" in NIPALS definition (do not confuse with nonlinear PLS, which represents nonlinear variant of the PLS approach!), I think that it refers not to the algorithm itself, but to nonlinear models, which can be analyzed, using linear regression-based NIPALS.
UPDATE 3 (Herman Wold's explanation):
While Herman Wold's 1969 paper seems to be the earliest paper on NIPALS, I have managed to find another one of the earliest papers on this topic. That is a paper by Wold (1974), where the "father" of PLS presents his rationale for using the word "partial" in NIPALS definition (p. 71):
3.1.4. NIPALS estimation: Iterative OLS. If one or more variables of the model are latent, the predictor relations involve not only unknown parameters, but also unknown variables, with the result that the estimation problem becomes nonlinear. As indicated in 3.1 (iii), NIPALS solves this problem by an iterative procedure, say with steps s = 1, 2, ... Each step s involves a finite number of OLS regressions, one for each predictor relation of the model. Each such regression gives proxy estimates for a sub-set of the unknown parameters and latent variables (hence the name partial least squares), and these proxy estimates are used in the next step of the procedure to calculate new proxy estimates.
References
Rosipal, R. (2011). Nonlinear partial least squares: An overview. In Lodhi H. and Yamanishi Y. (Eds.), Chemoinformatics and Advanced Machine Learning Perspectives: Complex Computational Methods and Collaborative Techniques, pp. 169-189. ACCM, IGI Global. Retrieved from http://aiolos.um.savba.sk/~roman/Papers/npls_book11.pdf
Sanchez, G. (2013). PLS path modeling with R. Berkeley, CA: Trowchez Editions. Retrieved from http://gastonsanchez.com/PLS_Path_Modeling_with_R.pdf
Wold, H. (1974). Causal flows with latent variables: Partings of the ways in the light of NIPALS modelling. European Economic Review, 5, 67-86. North Holland Publishing.
Wold, H. (1980). Model construction and evaluation when theoretical knowledge is scarce: Theory and applications of partial least squares. In J. Kmenta and J. B. Ramsey (Eds.), Evaluation of econometric models, pp. 47-74. New York: Academic Press. Retrieved from http://www.nber.org/chapters/c11693
Wold, S., Sjöström, M., & Eriksson, L. (2001). PLS-regression: A basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems, 58, 109-130. doi:10.1016/S0169-7439(01)00155-1 Retrieved from http://www.libpls.net/publication/PLS_basic_2001.pdf
They are different methods, independently of the number of response variables. Both methods combine PCA with ordinary multiple regression but it's done in a crucially different way. For a matrix of predictor variables X and one of dependent variables Y, principal component regression performs a PCA on predictor matrix X and then uses those principal components as regressors on Y. This technique removes multicolinearity but does not reduce the number of predictors down to the “best” subset. Picking out manually the most informative components won't work because these components were made from the variables in X and are therefore informative only to X, not Y.
On the other hand, PLS finds components which explain the covariance between X and Y (and calls them “latent vectors”). Hence, with PLS it's safer to assume that more informative components correspond to more relevant predictors.
Best Answer
Assuming your independent variable matrix is $m\times n$, that you have $m$ observations and $n$ variables.
For each PLS component (AKA latent variable), you get a loading vector ($n \times 1$), so for $h$ components the size of loading matrix ($P$) is $n \times h$. These loadings are calculated for both interpretation and algorithmic purposes but they have no use for prediction.
On the other hand, SIMPLS algorithm (I believe the most popular PLS flavor) also involves calculation of weight matrix ($W$), which has the same size as loading matrix. This orthogonal matrix $W$ is used to calculate $X$ scores ($T$):
$T = X\cdot W$
which is then multiplied by $Y$ loadings ($Q$) for prediction:
$\hat{Y} = T \cdot Q'$
Therefore, the regression coefficients ($\hat{B}$ that is $n\times1$ for a single dependent variable) that can be used to predict $Y$ directly from $X$ can be calculated:
$\hat{B} = W \cdot Q'$
All in all, one obtains a loading vector for each component whereas for different number components a same sized yet different regression coefficients are produced.
As far as I know, a similar logic applies to other PLS algorithms too.