Does anyone know how to compute the prediction interval for ridge regression? and what's its relation with prediction interval of OLS regression?
Solved – Calculate prediction interval for ridge regression
prediction intervalridge regression
Related Solutions
The rather anti-climatic answer to "Does anyone know why this is?" is that simply nobody cares enough to implement a non-negative ridge regression routine. One of the main reasons is that people have already started implementing non-negative elastic net routines (for example here and here). Elastic net includes ridge regression as a special case (one essentially set the LASSO part to have a zero weighting). These works are relatively new so they have not yet been incorporated in scikit-learn or a similar general use package. You might want to inquire the authors of these papers for code.
EDIT:
As @amoeba and I discussed on the comments the actual implementation of this is relative simple. Say one has the following regression problem to:
$y = 2 x_1 - x_2 + \epsilon, \qquad \epsilon \sim N(0,0.2^2)$
where $x_1$ and $x_2$ are both standard normals such as: $x_p \sim N(0,1)$. Notice I use standardised predictor variables so I do not have to normalise afterwards. For simplicity I do not include an intercept either. We can immediately solve this regression problem using standard linear regression. So in R it should be something like this:
rm(list = ls());
library(MASS);
set.seed(123);
N = 1e6;
x1 = rnorm(N)
x2 = rnorm(N)
y = 2 * x1 - 1 * x2 + rnorm(N,sd = 0.2)
simpleLR = lm(y ~ -1 + x1 + x2 )
matrixX = model.matrix(simpleLR); # This is close to standardised
vectorY = y
all.equal(coef(simpleLR), qr.solve(matrixX, vectorY), tolerance = 1e-7) # TRUE
Notice the last line. Almost all linear regression routine use the QR decomposition to estimate $\beta$. We would like to use the same for our ridge regression problem. At this point read this post by @whuber; we will be implementing exactly this procedure. In short, we will be augmenting our original design matrix $X$ with a $\sqrt{\lambda}I_p$ diagonal matrix and our response vector $y$ with $p$ zeros. In that way we will be able to re-express the original ridge regression problem $(X^TX + \lambda I)^{-1} X^Ty$ as $(\bar{X}^T\bar{X})^{-1} \bar{X}^T\bar{y}$ where the $\bar{}$ symbolises the augmented version. Check slides 18-19 from these notes too for completeness, I found them quite straightforward. So in R we would some like the following:
myLambda = 100;
simpleRR = lm.ridge(y ~ -1 + x1 + x2, lambda = myLambda)
newVecY = c(vectorY, rep(0, 2))
newMatX = rbind(matrixX, sqrt(myLambda) * diag(2))
all.equal(coef(simpleRR), qr.solve(newMatX, newVecY), tolerance = 1e-7) # TRUE
and it works. OK, so we got the ridge regression part. We could solve in another way though, we could formulate it as an optimisation problem where the residual sum of squares is the cost function and then optimise against it, ie. $ \displaystyle \min_{\beta} || \bar{y} - \bar{X}\beta||_2^2$. Sure enough we can do that:
myRSS <- function(X,y,b){ return( sum( (y - X%*%b)^2 ) ) }
bfgsOptim = optim(myRSS, par = c(1,1), X = newMatX, y= newVecY,
method = 'L-BFGS-B')
all.equal(coef(simpleRR), bfgsOptim$par, check.attributes = FALSE,
tolerance = 1e-7) # TRUE
which as expected again works. So now we just want : $ \displaystyle \min_{\beta} || \bar{y} - \bar{X}\beta||_2^2$ where $\beta \geq 0$. Which is simply the same optimisation problem but constrained so that the solution are non-negative.
bfgsOptimConst = optim(myRSS, par = c(1,1), X=newMatX, y= newVecY,
method = 'L-BFGS-B', lower = c(0,0))
all(bfgsOptimConst$par >=0) # TRUE
(bfgsOptimConst$par) # 2.000504 0.000000
which shows that the original non-negative ridge regression task can be solved by reformulating as a simple constrained optimisation problem. Some caveats:
- I used (practically) normalised predictor variables. You will need to account of the normalisation yourself.
- Same thing goes for the non normalisation of the intercept.
- I used
optim
's L-BFGS-B argument. It is the most vanilla R solver that accepts bounds. I am sure that you will find dozens of better solvers. - In general constraint linear least-squares problems are posed as quadratic optimisation tasks. This is an overkill for this post but keep in mind that you can get better speed if needed.
- As mentioned in the comments you could skip the ridge-regression as augmented-linear-regression part and directly encode the ridge cost function as an optimisation problem. This would be a lot simpler and this post significantly smaller. For the sake of argument I append this second solution too.
- I am not fully conversational in Python but essentially you can replicate this work by using NumPy's linalg.solve and SciPy's optimize functions.
- To pick the hyperparameter $\lambda$ etc. you just do the usual CV-step you would do in any case; nothing changes.
Code for point 5:
myRidgeRSS <- function(X,y,b, lambda){
return( sum( (y - X%*%b)^2 ) + lambda * sum(b^2) )
}
bfgsOptimConst2 = optim(myRidgeRSS, par = c(1,1), X = matrixX, y = vectorY,
method = 'L-BFGS-B', lower = c(0,0), lambda = myLambda)
all(bfgsOptimConst2$par >0) # TRUE
(bfgsOptimConst2$par) # 2.000504 0.000000
The method laid out below is the one described in Section 6.3.3 of Davidson and Hinckley (1997), Bootstrap Methods and Their Application. Thanks to Glen_b and his comment here. Given that there were several questions on Cross Validated on this topic, I thought it was worth writing up.
The linear regression model is: \begin{align} Y_i &= X_i\beta+\epsilon_i \end{align}
We have data, $i=1,2,\ldots,N$, which we use to estimate the $\beta$ as: \begin{align} \hat{\beta}_{\text{OLS}} &= \left( X'X \right)^{-1}X'Y \end{align}
Now, we want to predict what $Y$ will be for a new data point, given that we know $X$ for it. This is the prediction problem. Let's call the new $X$ (which we know) $X_{N+1}$ and the new $Y$ (which we would like to predict), $Y_{N+1}$. The usual prediction (if we are assuming that the $\epsilon_i$ are iid and uncorrelated with $X$) is: \begin{align} Y^p_{N+1} &= X_{N+1}\hat{\beta}_{\text{OLS}} \end{align}
The forecast error made by this prediction is: \begin{align} e^p_{N+1} &= Y_{N+1}-Y^p_{N+1} \end{align}
We can re-write this equation like: \begin{align} Y_{N+1} &= Y^p_{N+1} + e^p_{N+1} \end{align}
Now, $Y^p_{N+1}$ we have already calculated. So, if we want to bound $Y_{N+1}$ in an interval, say, 90% of the time, all we need to do is estimate consistently the $5^{th}$ and $95^{th}$ percentiles/quantiles of $e^p_{N+1}$, call them $e^5,e^{95}$, and the prediction interval will be $\left[Y^p_{N+1}+e^5,Y^p_{N+1}+e^{95} \right]$.
How to estimate the quantiles/percentiles of $e^p_{N+1}$? Well, we can write: \begin{align} e^p_{N+1} &= Y_{N+1}-Y^p_{N+1}\\ &= X_{N+1}\beta + \epsilon_{N+1} - X_{N+1}\hat{\beta}_{\text{OLS}}\\ &= X_{N+1}\left( \beta-\hat{\beta}_{\text{OLS}} \right) + \epsilon_{N+1} \end{align}
The strategy will be to sample (in a bootstrap kind of way) many times from $e^p_{N+1}$ and then calculate percentiles in the usual way. So, maybe we will sample 10,000 times from $e^p_{N+1}$, and then estimate the $5^{th}$ and $95^{th}$ percentiles as the $500^{th}$ and $9,500^{th}$ smallest members of the sample.
To draw on $X_{N+1}\left( \beta-\hat{\beta}_{\text{OLS}} \right)$, we can bootstrap errors (cases would be fine, too, but we are assuming iid errors anyway). So, on each bootstrap replication, you draw $N$ times with replacement from the variance-adjusted residuals (see next para) to get $\epsilon^*_i$, then make new $Y^*_i=X_i\hat{\beta}_{\text{OLS}}+\epsilon^*_i$, then run OLS on the new dataset, $\left(Y^*,X \right)$ to get this replication's $\beta^*_r$. At last, this replication's draw on $X_{N+1}\left( \beta-\hat{\beta}_{\text{OLS}} \right)$ is $X_{N+1}\left( \hat{\beta}_{\text{OLS}}-\beta^*_r \right)$
Given we are assuming iid $\epsilon$, the natural way to sample from the $\epsilon_{N+1}$ part of the equation is to use the residuals we have from the regression, $\left\{ e^*_1,e^*_2,\ldots,e^*_N \right\}$. Residuals have different and generally too small variances, so we will want to sample from $\left\{ s_1-\overline{s},s_2-\overline{s},\ldots,s_N-\overline{s} \right\}$, the variance-corrected residuals, where $s_i=e^*_i/\sqrt{(1-h_i)}$ and $h_i$ is the leverage of observation $i$.
And, finally, the algorithm for making a 90% prediction interval for $Y_{N+1}$, given that $X$ is $X_{N+1}$ is:
- Make the prediction $Y^p_{N+1}=X_{N+1}\hat{\beta}_{\text{OLS}}$.
- Make the variance-adjusted residuals, $\left\{ s_1-\overline{s},s_2-\overline{s},\ldots,s_N-\overline{s}\right\}$, where $s_i=e_i/\sqrt(1-h_{i})$.
- For replications $r=1,2,\ldots,R$:
- Draw $N$ times on the adjusted residuals to make bootstrap residuals $\left\{\epsilon^*_1,\epsilon^*_2,\ldots,\epsilon^*_N \right\}$
- Generate bootstrap $Y^*=X\hat{\beta}_{\text{OLS}}+\epsilon^*$
- Calculate bootstrap OLS estimator for this replication, $\beta^*_r=\left( X'X \right)^{-1}X'Y^*$
- Obtain bootstrap residuals from this replication, $e^*_r=Y^*-X\beta^*_r$
- Calculate bootstrap variance-adjusted residuals from this replication, $s^*-\overline{s^*}$
- Draw one of the bootstrap variance-adjusted residuals from this replication, $\epsilon^*_{N+1,r}$
- Calculate this replication's draw on $e^p_{N+1}$, $e^{p*}_r=X_{N+1}\left( \hat{\beta}_{\text{OLS}}-\beta^*_r \right)+\epsilon^*_{N+1,r}$
- Find $5^{th}$ and $95^{th}$ percentiles of $e^p_{N+1}$, $e^5,e^{95}$
- 90% prediction interval for $Y_{N+1}$ is $\left[Y^p_{N+1}+e^5,Y^p_{N+1}+e^{95} \right]$.
Here is R
code:
# This script gives an example of the procedure to construct a prediction interval
# for a linear regression model using a bootstrap method. The method is the one
# described in Section 6.3.3 of Davidson and Hinckley (1997),
# _Bootstrap Methods and Their Application_.
#rm(list=ls())
set.seed(12344321)
library(MASS)
library(Hmisc)
# Generate bivariate regression data
x <- runif(n=100,min=0,max=100)
y <- 1 + x + (rexp(n=100,rate=0.25)-4)
my.reg <- lm(y~x)
summary(my.reg)
# Predict y for x=78:
y.p <- coef(my.reg)["(Intercept)"] + coef(my.reg)["x"]*78
y.p
# Create adjusted residuals
leverage <- influence(my.reg)$hat
my.s.resid <- residuals(my.reg)/sqrt(1-leverage)
my.s.resid <- my.s.resid - mean(my.s.resid)
reg <- my.reg
s <- my.s.resid
the.replication <- function(reg,s,x_Np1=0){
# Make bootstrap residuals
ep.star <- sample(s,size=length(reg$residuals),replace=TRUE)
# Make bootstrap Y
y.star <- fitted(reg)+ep.star
# Do bootstrap regression
x <- model.frame(reg)[,2]
bs.reg <- lm(y.star~x)
# Create bootstrapped adjusted residuals
bs.lev <- influence(bs.reg)$hat
bs.s <- residuals(bs.reg)/sqrt(1-bs.lev)
bs.s <- bs.s - mean(bs.s)
# Calculate draw on prediction error
xb.xb <- coef(my.reg)["(Intercept)"] - coef(bs.reg)["(Intercept)"]
xb.xb <- xb.xb + (coef(my.reg)["x"] - coef(bs.reg)["x"])*x_Np1
return(unname(xb.xb + sample(bs.s,size=1)))
}
# Do bootstrap with 10,000 replications
ep.draws <- replicate(n=10000,the.replication(reg=my.reg,s=my.s.resid,x_Np1=78))
# Create prediction interval
y.p+quantile(ep.draws,probs=c(0.05,0.95))
# prediction interval using normal assumption
predict(my.reg,newdata=data.frame(x=78),interval="prediction",level=0.90)
# Quick and dirty Monte Carlo to see which prediction interval is better
# That is, what are the 5th and 95th percentiles of Y_{N+1}
#
# To do it properly, I guess we would want to do the whole procedure above
# 10,000 times and then see what percentage of the time each prediction
# interval covered Y_{N+1}
y.np1 <- 1 + 78 + (rexp(n=10000,rate=0.25)-4)
quantile(y.np1,probs=c(0.05,0.95))
Best Answer
This has been partly discussed on this related thread. The problem is that this technique introduces bias while trying to decrease the variance of parameter estimates, which works well in situations where multicollinearity does exist. However, the nice properties of the OLS estimators are lost and one has to resort to approximations in order to compute confidence intervals. While I think the bootstrap might offer a good solution to this, here are two references that might be useful: