As I discuss here and (especially) here, the usual definition of $R^2$ can be viewed as a comparison of the performance of your model in terms of square loss to the performance of a baseline model that always predicts the conditional mean to be the marginal/pooled mean.
$$
R^2=1-\left(\dfrac{
\overset{N}{\underset{i=1}{\sum}}\left(
y_i-\hat y_i
\right)^2
}{
\overset{N}{\underset{i=1}{\sum}}\left(
y_i-\bar y
\right)^2
}\right)
$$
The fraction numerator is the sum of squared residuals for your model, and the fraction denominator is the sum of squared residuals for a model that always predict $\bar y$. Using more words, we might write $R^2$ as:
$$
R^2=1-\dfrac{\text{
Square loss of the model of interest
}}{\text{
Square loss of the baseline model
}}
$$
In a quantile regression, the typical metric of interest is the pinball loss. After all, this is the canonical loss function for quantile regression. Therefore, let's change square loss to pinball loss in the $R^2$ formula to give what sklearn
refers to as $D^2$.
$$
D^2=1-\dfrac{\text{
Pinball loss of the model of interest
}}{\text{
Pinball loss of the baseline model
}}
$$
Pinball loss is defined as follows.
$$
\ell_{\tau}(y_i, \hat y_i) = \begin{cases}
\tau\vert y_i - \hat y_i\vert, & y_i - \hat y_i \ge 0 \\
(1 - \tau)\vert y_i - \hat y_i\vert, & y_i - \hat y_i < 0
\end{cases}
$$
Then we add up each individual $\ell$ to get a pinball loss $L$ for the whole model.
$$
L_{\tau}(y, \hat y) = \sum_{i=1}^n \ell_{\tau}(y_i, \hat y_i)
$$
Thus, pinball loss's $D^2$ would be defined as:
$$
D^2=1-\dfrac{
\overset{N}{\underset{i=1}{\sum}} \ell_{\tau}(y_i, \hat y_i)
}{\text{
Pinball loss of the baseline model
}}
$$
In this case, the $\hat y_i$ are the quantile regression outputs.
As far as the denominator goes, the logic for $R^2$ is that the reasonable naïve estimate of the conditional mean is the marginal/pooled mean. Translating this logic to quantile regression at quantile $\tau$, the reasonable naïve estimate of the conditional $\tau$ quantile is the marginal/pooled quantile $\tau$; denote this quantity as $y_{\tau}$. Thus, $D^2$ is calculated as:
$$
D^2=1-\dfrac{
\overset{N}{\underset{i=1}{\sum}} \ell_{\tau}(y_i, \hat y_i)
}{
\overset{N}{\underset{i=1}{\sum}} \ell_{\tau}(y_i, y_{\tau})
}
$$
So far, none of this mentioned linearity, just a comparison of model performance to that of a baseline that is analogous to the usual $R^2$. Consequently, I see this as totally viable in nonlinear quantile regressions.
NEXT is the issue of this paper discussed in the OP that uses an alternative loss function to estimate the quantile models. While I suppose you could stick the alternative loss function into this framework of comparing model loss to the loss incurred by a naïve model that makes the same prediction every time, the paper seems to be using this alternative loss function analogous to how linear regression uses ridge or LASSO loss. In such a case, the interest is still in square loss, but using an alternative optimiation criterion has advantages when it comes to how well the model will generalize. Consequently, whether the models are estimated with this more exotic technique or more classically, it seems like the interest is in pinball loss.
NEXT is the issue of an analogue of adjusted $R^2$. The usual adjusted $R^2$ considers the degrees of freedom. In a complicated model, especially one that you might tune using cross validation, it is not clear how to calculate the degrees of freedom, so I am not sold on the tractability of an adjusted $R^2$ analogue for quantile GAMs. However, to get a sense of overfitting-weary performance, it is possible to do an out-of-sample assessment. For an out-of-sample $D^2$, the numerator is calculated by comparing the out-of-sample predictions to the true out-of-sample values (which you know but withhold from the model training). The denominator is calculated using quantile $\tau$ of all $y$ values. This is where sklearn
and I disagree, as the package will use the out-of-sample value while I would advocate for the in-sample value. Hawinkel, Waegman & Maere (2023) advocate for this for $R^2$, and it seems philosophically correct in other settings not to look at the out-of-sample data to make a comparison to a model you should not be able to fit.
FINALLY is a software implementation. I did not understand the qgam::pinLoss
documentation. However, the package greybox
appears to calculate pinball loss. If all else fails, it is not so difficult to write your own pinball loss function.
pinball_loss <- function(y, yhat, tau = 0.5){
if (length(y) != length(yhat)){
print("Unequal lengths")
}
N <- length(y)
ell <- rep(NA, N)
for (i in 1:N){
if (y[i] - yhat[i] >= 0){
ell[i] <- tau * abs(y[i] - yhat[i])
}
if (y[i] - yhat[i] < 0){
ell[i] <- (1 - tau) * abs(y[i] - yhat[i])
}
}
return(sum(ell))
}
# Check how this compares with `greybox::pinball`
#
set.seed(2023)
N <- 100
R <- 1000
diffs <- rep(NA, R)
for (i in 1:R){
y <- rnorm(N)
yhat <- rnorm(N)
tau <- runif(1, 0, 1)
diffs[i] <- pinball_loss(y, yhat, tau = tau) -
greybox::pinball(y, yhat, level = tau)
}
summary(diffs) # All differences are on the order of 10^-14, so
# basically perfect agreement
This agrees with greybox::pinball
and can probably be written better, but it will give you the pinball loss.
REFERENCE
Stijn Hawinkel, Willem Waegeman & Steven Maere (2023) Out-of-sample $R^2$: estimation and inference, The American Statistician, DOI: 10.1080/00031305.2023.2216252
Best Answer
The methodology QCSIS comes from Robust model-free feature screening via quantile correlation. It uses the quantile correlation (QC) proposed by Li et al., given by $$ \text{qcor}_\tau(Y,Z) = \frac{\text{qcov}_\tau(Y,Z)}{\sqrt{\text{var}\{\psi_\tau(Y-Q_{\tau,Y}) \}\text{var}(Z)}}=\frac{\text{E}\{\psi_\tau(Y-Q_{\tau,Y})(Z-\text{E}(Z)\}}{\sqrt{(\tau-\tau^2)\text{var}(Z)}}, $$ where $\psi_\tau(u) = \tau - I(u < 0)$. It's not Pearson's correlation.