Solved – Precision regarding PRESS statistic and hat matrix

cross-validationpythonregressionscikit learn

I'm conducting a PLS regression with the python scikit-learn package. Unfortunately, from what I have seen, there is no implementation of the PRESS statistic which I need to validate my model with an other one.

In the precedent source, from what I understand, it says that it can be easily computed with the Hat matrix like so :

$$ PRESS_j = \frac{e_j}{1-H_{jj}} $$

Where $e_j$ is a residuals vector and $H_{jj}$ is the Hat Matrix j'th diagonal element

I found many sources, here for example, of how to compute the Hat matrix:

$$ \hat{Y} = X\hat{\beta} = X(X^TX)^{-1}X^TY = HY$$
$$ H = X(X^TX)^{-1}X^T $$

First question: is $\hat{Y} $ the predicted Y results, $X$ the training X matrix and $Y$ the training Y matrix?

Second question: with this approach, can we compute the PRESS for a multi target regression ? Because it is weird that $e_j = Y_{true} – Y_{pred}$ is a residual vector

Lastly: If someone has a source that helps understanding how to implement the PRESS statistic from what fitting a PLS model in Sklearn gives, I would be interested.

Best Answer

  1. yes, typically $\mathbf X$ are the input variates = your independent variables (here: training data), $\mathbf Y$ the dependent variables = prediction target. $\hat{\mathbf Y}$ are the estimates of $\mathbf Y$, i.e. the output of your prediction.

  2. or multi target (multi analyte) regression, the sensible approach is tpyically to calculate PRESS for each analyte. Combination of these is only sensbile in particular cases where your application suggests a combined measure for the different targets.

  3. If you spell out PRESS = PRediction Error Sum of Squares or PRediction Error Squared Sums that pretty much tells you how to calculate:

    1. calculate absolute prediction error $\hat{\mathbf Y} - \mathbf Y$
    2. square this error matrix element-wise
    3. sum over all test cases (separately for each target/analyte)

Important: You can use the short-cut of calculation via Hat-Matrix only if each row in your training data matrix represents a truly independent case. I.e. no repeated measurements, no dilution of calibration samples from the same stock solution, etc.

The Hat-Matrix expression you use is for MLR, i.e. a non-regularized multiple least squares, for PLS:, please see

  • A. C. Olivieri et al.: Uncertainty Estimation and Figures of Merit for Multivariate Calibration, Pure Appl. Chem., 2006, 78, 3, 633 - 661.
  • A. C. Olivieri: Analytical Figures of Merit: From Univariate to Multiway Calibration, Chemical Reviews, 2014, 114, 5358 - 5378.

Update wrt. 2nd question in comment

According to your comment, I think the situation is a bit more complicated.

  • On the one hand, to compare your model with the one fitted in JMP, you'll want to reproduce as close as possible the figure of merit calculated by JMP.
    If that is leave-one [row]-out, that's what you want to do for the comparison, and for that calculation you can take the hat-matrix shortcut for OLS aka CLS (ordinary or classical least squares) models.
    Expressions for hat matrices exist for other bilinear models such as Partial Least Squares or Principal Component regression, but they are not the same as the hat matrix for OLS (and you'd need to look closely into assumptions they make, e.g. no preprocessing that covers multiple cases - such as centering).

  • OTOH, I've very rarely met real-world data where such an expression could have been used for estimating generalization error.

    • For real-world, data the rows often are not independent (so-called clustered data). If they are not, leave-one-row-out is not a good approximation for generalization error, and leave-cluster-out should be used instead:
    • In my field, data is typically preprocessed before PLS and that may break said independence between the left-out row and the surrogate training data unless the preprocessig is recalculated as well. PLS models typically include such a preprocessing step: the centering of $\mathbf X$ and $\mathbf Y$.

    For estimating generalization error, achieving independence in the split (and not having any kind of training information leaking into the test data) is of utmost importance. This guards against/properly detects overfitting, and there is often no way around re-calculating. (Though it doesn't need to be LOO, in fact other resampling schemes are often better, and that possibly even at lower computational effort).
    Overfitting would not be so much of a concern if you anyways had far more independent cases in your data than variates. But then, also PLS regularization would not be needed.