Solved – how to find a linear combination of predictors maximizing correlation between its score and dependent variable in R

correlationpartial least squaresr

Please correct me if I am wrong as I am not good at R. I think I can find a linear combination maximizing correlation between predictors and dependent variables by running partial least squares analysis with standardized X without intercept and standardized Y matrix.

But how can I do that with pls or any other package in R? Does it do that with default settings?

Best Answer

No, PLS package does not maximize correlation between scores and response values in default settings. I couldn't so far find whether the package has that functionality or not although manual mentioned about it with a sentence.

And you are right. You need to deal with standardized matrices to do PLS for the correlation maximization.

Related Solutions

Solved – Partial least squares regression in R: why is PLS on standardized data not equivalent to maximizing correlation

PLS regression relies on iterative algorithms (e.g., NIPALS, SIMPLS). Your description of the main ideas is correct: we seek one (PLS1, one response variable/multiple predictors) or two (PLS2, with different modes, multiple response variables/multiple predictors) vector(s) of weights, $u$ (and $v$), say, to form linear combination(s) of the original variable(s) such that the covariance between Xu and Y (Yv, for PLS2) is maximal. Let us focus on extracting the first pair of weights associated to the first component. Formally, the criterion to optimize reads $$\max\text{cov}(Xu, Yv).\qquad (1)$$ In your case, $Y$ is univariate, so it amounts to maximize $$\text{cov}(Xu, y)\equiv \text{Var}(Xu)^{1/2}\times\text{cor}(Xu, y)\times\text{Var}(y)^{1/2},\quad st. \|u\|=1.$$ Since $\text{Var}(y)$ does not depend on $u$, we have to maximise $\text{Var}(Xu)^{1/2}\times\text{cor}(Xu, y)$. Let's consider X=[x_1;x_2], where data are individually standardized (I initially made the mistake of scaling your linear combination instead of $x_1$ and $x_2$ separately!), so that $\text{Var}(x_1)=\text{Var}(x_2)=1$; however, $\text{Var}(Xu)\neq 1$ and depends on $u$. In conclusion, maximizing the correlation between the latent component and the response variable will not yield the same results.

I should thank Arthur Tenenhaus who pointed me in the right direction.

Using unit weight vectors is not restrictive and some packages (pls. regression in plsgenomics, based on code from Wehrens's earlier package pls.pcr) will return unstandardized weight vectors (but with latent components still of norm 1), if requested. But most of PLS packages will return standardized $u$, including the one you used, notably those implementing the SIMPLS or NIPALS algorithm; I found a good overview of both approaches in Barry M. Wise's presentation, Properties of Partial Least Squares (PLS) Regression, and differences between Algorithms, but the chemometrics vignette offers a good discussion too (pp. 26-29). Of particular importance as well is the fact that most PLS routines (at least the one I know in R) assume that you provide unstandardized variables because centering and/or scaling is handled internally (this is particularly important when doing cross-validation, for example).

Given the constraint $u'u=1$, the vector $u$ is found to be $$u=\frac{X'y}{\|X'y\|}.$$

Using a little simulation, it can be obtained as follows:

set.seed(101)
X <- replicate(2, rnorm(100))
y <- 0.6*X[,1] + 0.7*X[,2] + rnorm(100)
X <- apply(X, 2, scale)
y <- scale(y)

# NIPALS (PLS1)
u <- crossprod(X, y)
u <- u/drop(sqrt(crossprod(u)))         # X weights
t  <- X%*%u
p <- crossprod(X, t)/drop(crossprod(t)) # X loadings

You can compare the above results (u=[0.5792043;0.8151824], in particular) with what R packages would give. E.g., using NIPALS from the chemometrics package (another implementation that I know is available in the mixOmics package), we would obtain:

library(chemometrics)
pls1_nipals(X, y, 1)$W  # X weights [0.5792043;0.8151824]
pls1_nipals(X, y, 1)$P  # X loadings

Similar results would be obtained with plsr and its default kernel PLS algorithm:

> library(pls)
> as.numeric(loading.weights(plsr(y ~ X, ncomp=1)))
[1] 0.5792043 0.8151824

In all cases, we can check that $u$ is of length 1.

Provided you change your function to optimize to one that reads

f <- function(u) cov(y, X%*%(u/sqrt(crossprod(u))))

and normalize u afterwards (u <- u/sqrt(crossprod(u))), you should be closer to the above solution.

Sidenote: As criterion (1) is equivalent to $$\max u'X'Yv,$$ $u$ can be found as the left singular vector from the SVD of $X'Y$ corresponding to the largest eigenvalue:

svd(crossprod(X, y))$u

In the more general case (PLS2), a way to summarize the above is to say that the first PLS canonical vectors are the best approximation of the covariance matrix of X and Y in both directions.

References

Tenenhaus, M (1999). L'approche PLS. Revue de Statistique Appliquée, 47(2), 5-40.
ter Braak, CJF and de Jong, S (1993). The objective function of partial least squares regression. Journal of Chemometrics, 12, 41–54.
Abdi, H (2010). Partial least squares regression and projection on latent structure regression (PLS Regression). Wiley Interdisciplinary Reviews: Computational Statistics, 2, 97-106.
Boulesteix, A-L and Strimmer, K (2007). Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioinformatics, 8(1), 32-44.

Solved – How to fit data with nonlinear partial least squares in R

Have you checked the package pls.pcr from the CRAN?

Here is a link to the help page.

The mvr function has an option method="kernelPLS" that seem fairly easy to use.

Best Answer

Related Solutions

Solved – Partial least squares regression in R: why is PLS on standardized data not equivalent to maximizing correlation

References

Solved – How to fit data with nonlinear partial least squares in R

Related Question