Regression – How to Resolve Unexpected Statistical Prediction Intervals in Statsmodels Python

multiple regressionprediction intervalpythonregressionstatsmodels

I am using the statsmodels python package to perform multivariate linear regression. I want to produce 80% prediction interval bands as part of my result.

The statsmodels package can produce prediction intervals for a given alpha and new predictor(s). Fortunately my residuals are normally distributed so the conventional prediction interval for normally distributed residuals is valid. My understanding is that statsmodels uses the conventional prediction interval calculation:https://online.stat.psu.edu/stat501/lesson/3/3.3

I have attached a plot of a sample of my actual target variable / predicted equivalent. The shaded blue region represents my 80% prediction interval (I have bounded it to zero, because negative values are not possible). The model was trained on data of exactly the same shape i.e., Null, rising to some peak, and then returning to 0 again.

My question is why doesn't the prediction interval vary as much as I'd expect it to along the x-axis? I'd expect it to drop to zero in the first and final third of the x-axis, given that this happens for all of the sample data I used to train the model.

It is worth mentioning that the prediction interval I have is not constant – it is dependent on the predictor values.

I know I can use bootstrapping to generate a more accurate prediction interval, but I'm curious why this conventional statistical PI isn't working for my use case.

enter image description here

EDIT:
I cannot post my training data, but here is an example code excerpt to show the method I was using (ordinary least-squares):

    from statsmodels.regression.linear_model import OLS, OLSResults

    """
    X_train is a 5-feature (5 column) training set
    y_train is a single-column of target variables

    X_test is a 5-feature testing set
    y_test is a single-column of un-seen target variables

    prediction_summary_frame is the dataframe of predictions produced by statsmodels. It
    has a row of predictions per sample in X_new, with columns: 
    
    mean
    mean_se
    obs_ci_lower
    obs_ci_upper
    mean_ci_lower
    mean_ci_upper
    
    Where mean_ci_upper and mean_ci_lower values represent  the confidence interval, and
    obs_ci_lower and obs_ci_upper represent the prediction intervals. 
    """

    model = OLS(X_train, y_train)
    ols_results = OLS.fit()

    prediction_summary_frame = ols_results.get_prediction(X_test).summary_frame(
        alpha=0.1
    )

     ax.plot(
            X_test.index,
            prediction_summary_frame["mean"],
            label="predicted",
        )
     ax.fill_between(
            X_test.index,
            prediction_summary_frame["obs_ci_upper"],
            prediction_summary_frame["obs_ci_lower"],
            alpha=0.4,
        )
     ax.plot(
            X_test.index,
            y_test,
            label="actual",
        )
     plt.show()

```

Best Answer

Prediction interval for OLS contains two components, uncertainty about the predicted mean plus uncertainty of a new residual.

In OLS, the assumption is that the residual variance is constant, so the width of the prediction interval coming from the second component will be the same for all values of x.

The model that underlies OLS is linear and does not impose any nonnegativity constraints, so the prediction interval in the first and last part will be centered around mean zero, but with approximately the same width as in other parts because of the homoscedasticity.

A more appropriate model would be Poisson likelihood model or quasi-likelihood model. Poisson is inherently heteroscedastic, the variance is equal to the mean. With a log-link, exponential mean function, the prediction interval would collapse to {0} or {0, 1} set as the Poisson rate or mean goes to zero.

(However, statsmodels currently provides prediction intervals only for OLS, but not for nonlinear models like Poisson and non-gaussian GLM. Prediction intervals that ignore parameter uncertainty can be computed from the properties of the Poisson distribution, but combining those prediction intervals with parameter uncertainty requires more work, e.g. computing tolerance intervals as in https://www.statsmodels.org/dev/generated/statsmodels.stats.rates.tolerance_int_poisson.html )

Related Question