I am trying to choose how many components to retain in my PLSR. My total variance explained for the response variable is only about 30%, and the first 2 components explain 99% of this. Intuitively I know I should just run the model with the first 2 components, but is there a more rigorous method I can apply? Does the Kaiser-Guttman or scree test apply here since the PLSR computes singular values rather than eigenvalues?
Solved – Choosing the number of PLSR components
model selectionpartial least squaresregression
Related Solutions
The problem that I see with your question is as follows:
31 is not a VERY large number of variables, at least not so large that you could not by-hand cluster similar variables into 4 or 5 latent variables using sum-scores, as you aim to do. This should give very approximately similar results to the factor analysis. If it doesn't, I would trust the by-hand scores more. The benefit of doing this is:
- Scoring is done by nature of the research question, not the structure of the collected data.
- The usual assumptions and very large "p" of data mining hardly apply here so the structure of the data is dubious to begin with. I am not confident that a number of "orthogonal" components would summarize something that school board educators would be interested in.
- 0 reproducibility error. Very easy to replicate and understand results. Could potentially benchmark and compare results between districts.
- People reviewing such an analysis will agree that, while the measure may not be perfect, it should have good power to go about conducting a confirmatory factor analysis.
I am not advocating that you should inspect, say, a heirarchical clustering and/or heatmap or use other analyses to show the interdependence of variables, and/or that you shouldn't try to, say, run a univariate factor analysis and create latent varaible scores using these as independent predictors in a regression model (note that the standard errors here aren't correct because they don't account for uncertainty in the scores). These types of analyses can help to better understand the confirmatory analysis above.
Yes, you did re-invent the wheel.
What proportion of the overall variance of $Y$ can be explained by $X$
This question is answered by $R^2$ of linear regression, it's just that in this case you have to deal with multivariate regression. Still, you can regress your whole matrix $\mathbf Y$ on $\mathbf X$ (both centered) by finding matrix $\mathbf B$ minimizing $\|\mathbf Y-\mathbf X\mathbf B\|$, and then proportion of explained variance is given by $$R^2 = 1-\frac{\|\mathbf Y-\mathbf X\mathbf B\|^2}{\|\mathbf Y\|^2}.$$ It is probably a one-liner in R using lm
.
What you did is a complicated way to compute the same quantity. You can regress each variable in $\mathbf Y$ on all variables in $\mathbf X$, take $R^2$ multiplied by the fraction of variance of $\mathbf Y$ that this variable captures, and sum over all variables. You can replace $\mathbf X$ by its principal components. You can replace $\mathbf Y$ by its principal components too. In each case you will get the same number.
Finally, regarding the situation when there is less data points than dimensions in $\mathbf X$. In this case $R^2 = 1$, because any $\mathbf Y$ can be explained by any $\mathbf X$: you can find $\mathbf B$ such that $\mathbf Y=\mathbf X\mathbf B$ holds exactly.
Best Answer
Use cross-validation.
A quick example: for each number of components(by the way they are called latent variables in PLS unlike principle component in PCA), remove the first sample from your data, and with the remaining data set construct a full PLS model. Use that model to predict the removed sample. Then put the removed sample back remove the next one and predict it and so on. This type of cross-validation is called leave-one-out cross validation.
Finally you will have a prediction for each sample and their real values. Using that information calculate an error which is basically the difference of predicted vs real value or the square of that difference.
Repeat this for each number of components. You can now draw a graph which has number of components in X axis and corresponding errors on Y axis.
As a rule of thumb you should NOT go ahead and select the component(latent variable) yielding the least error because it may overfit. You should choose where the drop in error values are not significant any more. Another common case is that: there might be a point where the error values increase. Choose a point right before that increment.
Reference for graph: Brereton, Richard G. Applied chemometrics for scientists. John Wiley & Sons, 2007.