Solved – Implementing Bayesian Linear Regression using PyMC3

bayesiangeneralized linear modelprediction intervalpython

I am learning a Bayesian Approach towards implementing Linear Regression.

The motivation is that Bayesian Approach gives you a range on predictions which might be useful when investing money in capital markets or for any medical research.

So far what I've understood is that given a linear equation we are trying to estimate equation parameters using Bayes' theorem as suggested in this link.

According to Bayes' Theorem
$$ posterior \propto likelihood \times prior $$

What is the mathematical proof that in case of linear regression
if likelihood is $$ Y|X,\theta \sim N(\alpha \space + \space \beta x, \epsilon^2) $$

and prior for mu $$ \mu \sim N(\mu, \sigma^2) $$ & prior for sigma is $$ \epsilon^2 \sim IG(\alpha,\beta) $$
then posterior distribution would be Normal Distribution.

Using this link I've implemented a basic linear regression example in python for which the code is

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import pymc3 as pm 
from scipy import optimize

alpha, sigma = 1, 1
beta = [1, 2.5]

# Size of dataset
size = 100

# Predictor variable
X1 = np.linspace(0, 1, size)
X2 = np.linspace(0,.2, size)

Y = alpha + beta[0]*X1 + beta[1]*X2 + np.random.randn(size)*sigma

# plt.plot(Y)
# plt.show()

basic_model = pm.Model()

with basic_model:
    # Priors for unknown model parameters
    alpha = pm.Normal('alpha', mu=0, sd=10)
    beta = pm.Normal('beta', mu=0, sd=10, shape=2)
    sigma = pm.HalfNormal('sigma', sd=1)

    # Expected value of outcome
    mu = alpha + beta[0]*X1 + beta[1]*X2

    # Likelihood (sampling distribution) of observations
    Y_obs = pm.Normal('Y_obs', mu=mu, sd=sigma, observed=Y)

    # obtain starting values via MAP
    start = pm.find_MAP(fmin=optimize.fmin_powell)

    # instantiate sampler
    step = pm.NUTS(scaling=start)

    trace = pm.sample(2000, step, start=start, cores=4)

pm.traceplot(trace) 
plt.show()


pm.summary(trace)


summary_df = pm.summary(trace)

predictions = summary_df.loc['alpha','mean'] + summary_df.loc['beta__0','mean']*X1 + summary_df.loc['beta__1','mean']*X2 + np.random.randn(size)*summary_df.loc['sigma','mean']
upper_limit = summary_df.loc['alpha','hpd_97.5'] + summary_df.loc['beta__0','hpd_97.5']*X1 + summary_df.loc['beta__1','hpd_97.5']*X2 + np.random.randn(size)*summary_df.loc['sigma','hpd_97.5']
lower_limit = summary_df.loc['alpha','hpd_2.5'] + summary_df.loc['beta__0','hpd_2.5']*X1 + summary_df.loc['beta__1','hpd_2.5']*X2 + np.random.randn(size)*summary_df.loc['sigma','hpd_2.5']

plt.plot(predictions, label='Predictions')
plt.plot(upper_limit, label='Upper Limit')
plt.plot(lower_limit, label='Lower Limit')
plt.plot(Y, label='Actual')
plt.legend()
plt.show()

After analyzing the results from summary of trace plot I've observed that the estimates for beta__0 and beta__1 are not good. Below are the results.

             mean        sd  mc_error    hpd_2.5   hpd_97.5        n_eff      Rhat
alpha    0.992383  0.196083  0.002652   0.614840   1.381226  4978.643737  0.999964
beta__0  1.609108  1.973816  0.064427  -2.174173   5.570459   905.298746  1.001097
beta__1  0.099368  9.739603  0.321035 -19.832449  18.345334   889.614045  1.001005
sigma    0.989427  0.071813  0.000799   0.858429   1.137455  7452.629272  1.000030

Questions for which I need an answer are as follows:

  1. Proof to the above mentioned prior and likelihood would give a posterior which follows a normal distribution.
  2. I expected beta__0 to be approx 1 and beta__1 to be approx 2.5. Is there any reason that justify the bad results? In case of alpha and sigma the value of mean is approx. 1 which is quite close to actual value 1 that was used while generating dummy data.
  3. How do we decide on likelihood and prior distributions when estimating model parameters using bayesian? Do I need to change sigma = pm.HalfNormal('sigma', sd=1) in code to IG distribution?
  4. Is the implemented code for making predictions correct? I've used hpd_2.5 and hpd_97.5 as a lower and upper bound respectively to generate range of predictions is that correct? If yes, then how can upper limit values be lesser than value of lower limit?

Edit: Updated code with prediction vs actual plot. Not sure if the implementation is correct.

Best Answer

The reason you are getting bad results is because your X1 and X2 are perfectly correlated. Therefore there are many combinations of beta1 and beta2 that have the same result.

Use randomly selected X1 and X2 ( instead of linspace) for example to remove this problem.

To get credible intervals for the predictions, you just use the trace to calculate Y for each set of coefficients in your trace. That gives you samples from posterior distribution of Y, and then need to calculate hpd ( probably pymc has function to calculate hpd given posterior sample)