Regression – How to Resolve Unexpected Statistical Prediction Intervals in Statsmodels Python

multiple regressionprediction intervalpythonregressionstatsmodels

I am using the statsmodels python package to perform multivariate linear regression. I want to produce 80% prediction interval bands as part of my result.

The statsmodels package can produce prediction intervals for a given alpha and new predictor(s). Fortunately my residuals are normally distributed so the conventional prediction interval for normally distributed residuals is valid. My understanding is that statsmodels uses the conventional prediction interval calculation:https://online.stat.psu.edu/stat501/lesson/3/3.3

I have attached a plot of a sample of my actual target variable / predicted equivalent. The shaded blue region represents my 80% prediction interval (I have bounded it to zero, because negative values are not possible). The model was trained on data of exactly the same shape i.e., Null, rising to some peak, and then returning to 0 again.

My question is why doesn't the prediction interval vary as much as I'd expect it to along the x-axis? I'd expect it to drop to zero in the first and final third of the x-axis, given that this happens for all of the sample data I used to train the model.

It is worth mentioning that the prediction interval I have is not constant – it is dependent on the predictor values.

I know I can use bootstrapping to generate a more accurate prediction interval, but I'm curious why this conventional statistical PI isn't working for my use case.

EDIT:
I cannot post my training data, but here is an example code excerpt to show the method I was using (ordinary least-squares):

    from statsmodels.regression.linear_model import OLS, OLSResults

    """
    X_train is a 5-feature (5 column) training set
    y_train is a single-column of target variables

    X_test is a 5-feature testing set
    y_test is a single-column of un-seen target variables

    prediction_summary_frame is the dataframe of predictions produced by statsmodels. It
    has a row of predictions per sample in X_new, with columns: 
    
    mean
    mean_se
    obs_ci_lower
    obs_ci_upper
    mean_ci_lower
    mean_ci_upper
    
    Where mean_ci_upper and mean_ci_lower values represent  the confidence interval, and
    obs_ci_lower and obs_ci_upper represent the prediction intervals. 
    """

    model = OLS(X_train, y_train)
    ols_results = OLS.fit()

    prediction_summary_frame = ols_results.get_prediction(X_test).summary_frame(
        alpha=0.1
    )

     ax.plot(
            X_test.index,
            prediction_summary_frame["mean"],
            label="predicted",
        )
     ax.fill_between(
            X_test.index,
            prediction_summary_frame["obs_ci_upper"],
            prediction_summary_frame["obs_ci_lower"],
            alpha=0.4,
        )
     ax.plot(
            X_test.index,
            y_test,
            label="actual",
        )
     plt.show()

```

Best Answer

Prediction interval for OLS contains two components, uncertainty about the predicted mean plus uncertainty of a new residual.

In OLS, the assumption is that the residual variance is constant, so the width of the prediction interval coming from the second component will be the same for all values of x.

The model that underlies OLS is linear and does not impose any nonnegativity constraints, so the prediction interval in the first and last part will be centered around mean zero, but with approximately the same width as in other parts because of the homoscedasticity.

A more appropriate model would be Poisson likelihood model or quasi-likelihood model. Poisson is inherently heteroscedastic, the variance is equal to the mean. With a log-link, exponential mean function, the prediction interval would collapse to {0} or {0, 1} set as the Poisson rate or mean goes to zero.

(However, statsmodels currently provides prediction intervals only for OLS, but not for nonlinear models like Poisson and non-gaussian GLM. Prediction intervals that ignore parameter uncertainty can be computed from the properties of the Poisson distribution, but combining those prediction intervals with parameter uncertainty requires more work, e.g. computing tolerance intervals as in https://www.statsmodels.org/dev/generated/statsmodels.stats.rates.tolerance_int_poisson.html )

Related Solutions

Solved – feature scaling giving reduced output (linear regression using gradient descent)

My guess is that you have accidentally transformed y_train (somewhere hidden in the code you have not posted). This because this reproducible snippets works

import numpy as np
import pandas as pd
import math
from sklearn import preprocessing

dat = pd.read_csv("/home/steffen/workspaces/airfoil/airfoil_self_noise.dat",sep="\t",low_memory=False,header=None)

apply_scaler = True

# split into train 2/3 and test 1/3
rng = np.random.RandomState(42)

n_rows = dat.shape[0]
n_train = math.floor(0.66*n_rows)

permutated_indices = rng.permutation(n_rows)

train_dat = dat.loc[permutated_indices[:n_train],:]
test_dat =  dat.loc[permutated_indices[n_train:],:]

# separate the response variable (last column) from the predictor variables
x_train = train_dat.iloc[:,1:-1]
y_train = (train_dat.iloc[:,-1])[:, np.newaxis]

x_test = test_dat.iloc[:,1:-1]
y_test = (test_dat.iloc[:,-1])[:, np.newaxis]

# train
# fit the scaler to predictor variables and apply it afterwards
scaler = preprocessing.StandardScaler().fit(x_train)

if apply_scaler:
    x_train = pd.DataFrame(scaler.transform(x_train))

# add constant one for the intercept parameter
x_train = pd.concat([pd.DataFrame(np.ones(shape=(x_train.shape[0],1)),index=x_train.index),x_train],axis=1)

# fit parameters of linear regression using batch gradient descent
# Hands-On Machine Learning with Scikit-Learn & Tensorflow, page 115
eta = 0.1 # learning rate
n_iterations = 1000
m = x_train.shape[0]
theta = rng.randn(x_train.shape[1],1)

for iteration in range(n_iterations):
    gradients = (2 / m) * x_train.T.dot(x_train.dot(theta) - y_train)
    theta = theta - eta * gradients

# to apply the fitted parameters, first we have to transform the test-data in the same way
# apply scaler
if apply_scaler:
    x_test = pd.DataFrame(scaler.transform(x_test))

# add constant one for the intercept parameter
x_test = pd.concat([pd.DataFrame(np.ones(shape=(x_test.shape[0],1)),index=x_test.index),x_test],axis=1)

# apply fitted parameters
y_predict =x_test.dot(theta)

# compare output
out=np.column_stack((y_test, y_predict))
print(pd.DataFrame(out).head())
# root mean squared error
print("error %f"% np.sqrt(np.power(y_test-y_predict,2).mean()))

This leads to this output

         0           1
0  120.573  127.108268
1  127.220  123.492931
2  113.045  122.393120
3  119.606  122.570836
4  131.971  127.270743
error 6.175637

which is fine.

It is interesting to see that for learning rate 0.1 this simple batch gradient descent implementation fails to converge if no normalization is performed (apply_scaler=False, eta=0.1), while the Linear Regression implementation of scikit learn still finds a solution. Reducing the learning rate dramatically (eta=0.0001) leads to convergence again.

This is one example where the Gradient Descent is limited, as discussed here: Do we need gradient descent to find the coefficients of a linear regression model.

Solved – Standard Error of prediction for Logistic Sigmoid function

Something like below should work:

x = log_mdl.model.exog
c = log_mdl.cov_params()
vcov = np.dot(x, np.dot(c, x.T))
se = np.sqrt(np.diag(vcov))

If x is very large there are faster ways to do this that only compute the diagonal of vcov, i.e.

v = (x * np.dot(x, c)).sum(1)
se = np.sqrt(v)

But this is less transparent.

For terminology, I think you could refer to this as the standard error for the logit probabilities.

Best Answer

Related Solutions

Solved – feature scaling giving reduced output (linear regression using gradient descent)

Solved – Standard Error of prediction for Logistic Sigmoid function

Related Question