Solved – Understanding Quantile Regression with Scikit-Learn

machine learningpythonquantile regressionregressionscikit learn

I have a case where I want to predict a time value in minutes.

This is the problem of regression.

I also want to predict the upper bound and lower bound.

I can do it two ways:

Train 3 models: one for the main prediction, one for say a higher prediction and one for a lower prediction.
Use Quantile regression whcih gives a lower and upper bound.

However, I am not understanding how Quantile regression works.

Here is the code:

import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import GradientBoostingRegressor

np.random.seed(1)


#----------------------------------------------------------------------
#  First the noiseless case
X = np.atleast_2d(np.random.uniform(0, 10.0, size=100)).T
X = X.astype(np.float32)

# Observations
y = f(X).ravel()

dy = 1.5 + 1.0 * np.random.random(y.shape)
noise = np.random.normal(0, dy)
y += noise
y = y.astype(np.float32)

# Mesh the input space for evaluations of the real function, the prediction and
# its MSE
xx = np.atleast_2d(np.linspace(0, 10, 1000)).T
xx = xx.astype(np.float32)

alpha = 0.95

clf = GradientBoostingRegressor(loss='quantile', alpha=alpha,
                                n_estimators=250, max_depth=3,
                                learning_rate=.1, min_samples_leaf=9,
                                min_samples_split=9)

clf.fit(X, y)

# Make the prediction on the meshed x-axis
y_upper = clf.predict(xx)

clf.set_params(alpha=1.0 - alpha)
clf.fit(X, y)

# Make the prediction on the meshed x-axis
y_lower = clf.predict(xx)

clf.set_params(loss='ls')
clf.fit(X, y)

# Make the prediction on the meshed x-axis
y_pred = clf.predict(xx)

# Plot the function, the prediction and the 90% confidence interval based on
# the MSE
fig = plt.figure()
plt.plot(X, y, 'b.', markersize=10, label=u'Observations')
plt.plot(xx, y_pred, 'r-', label=u'Prediction') # pred
plt.plot(xx, y_upper, 'k-') # 
plt.plot(xx, y_lower, 'k-') # 
plt.fill(np.concatenate([xx, xx[::-1]]),
         np.concatenate([y_upper, y_lower[::-1]]),
         alpha=.5, fc='b', ec='None', label='90% prediction interval')
plt.xlabel('$x$')
plt.ylabel('$f(x)$')
plt.ylim(-10, 20)
plt.legend(loc='upper left')
plt.show()

My questions are:

How does quantile regression work here i.e. how is the model trained?
How to use a quantile regression mode at prediction time, does it give 3 predictions, what is y_lower and y_upper?

Best Answer

To answer your questions:

How does quantile regression work here i.e. how is the model trained?

When creating the classifier, you've passed loss='quantile' along with alpha=0.95. You are optimizing quantile loss for 95th percentile in this situation. You can read up more on how quantile loss works here and here.

How to use a quantile regression mode at prediction time, does it give 3 predictions, what is y_lower and y_upper?

In your code, you have created one classifier. You're first fitting and predicting for alpha=0.95, then using clf.set_params() you're using the same classifier to fit and predict for alpha=0.05.

For real predictions, you'll fit 3 (or more) classifiers set at all the different quantiles required to get 3 (or more) predictions.

Related Solutions

Solved – Having trouble understanding cross-validation results from scikit-learn

From section 7.10.2 of Elements of Statistical Learning(free online, and it's great):

Consider a classification problem with a large number of predictors, as may arise, for example, in genomic or proteomic applications. A typical strategy for analysis might be as follows:

Screen the predictors: find a subset of “good” predictors that show fairly strong (univariate) correlation with the class labels
Using just this subset of predictors, build a multivariate classifier.
Use cross-validation to estimate the unknown tuning parameters and to estimate the prediction error of the final model.

Is this a correct application of cross-validation? Consider a scenario with N = 50 samples in two equal-sized classes, and p = 5000 quantitative predictors (standard Gaussian) that are independent of the class labels. The true (test) error rate of any classifier is 50%. We carried out the above recipe, choosing in step (1) the 100 predictors having highest correlation with the class labels, and then using a 1-nearest neighbor classifier, based on just these 100 predictors, in step (2). Over 50 simulations from this setting, the average CV error rate was 3%. This is far lower than the true error rate of 50%.

What has happened? The problem is that the predictors have an unfair advantage, as they were chosen in step (1) on the basis of all of the samples. Leaving samples out after the variables have been selected does not cor-rectly mimic the application of the classifier to a completely independent test set, since these predictors “have already seen” the left out samples.

We selected the 100 predictors having largest correlation with the class labels over all 50 samples. Then we chose a random set of 10 samples, as we would do in five-fold cross-validation, and computed the correlations of the pre-selected 100 predictors with the class labels over just these 10 samples (top panel). We see that the correlations average about 0.28, rather than 0, as one might expect

Solved – regression with scikit-learn with multiple outputs, svr or gbm possible

Why not make a wrapper that would fit m regressors (where m is dimensionality of each y) like this?

class VectorRegression(sklearn.base.BaseEstimator):
    def __init__(self, estimator):
        self.estimator = estimator

    def fit(self, X, y):
        n, m = y.shape
        # Fit a separate regressor for each column of y
        self.estimators_ = [sklearn.base.clone(self.estimator).fit(X, y[:, i])
                               for i in range(m)]
        return self

    def predict(self, X):
        # Join regressors' predictions
        res = [est.predict(X)[:, np.newaxis] for est in self.estimators_]
        return np.hstack(res)

Note: I haven't tested this code, but you got the idea.

Best Answer

Related Solutions

Solved – Having trouble understanding cross-validation results from scikit-learn

Solved – regression with scikit-learn with multiple outputs, svr or gbm possible

Related Question