Bayesian Linear Regression – Understanding Overconfidence in Models

bayesianlinearMATLABregression

I have encountered a problem with Bayesian linear regression models, which I describe in the following. I hope that someone can give me a better understand of Bayesian models or has a possible fix.

Lets say that we have two different functions (for simplicity a quadratic and linear one):

$f_1(x) = a_1 + a_2x + a_3x^2$

$f_2(x) = b_1 + b_2x$

Now, we want to infer the parameters ($b_1, b_2$) of the second function based on data of the first one by Bayesian linear regression. A small noise term is known and fixed for this example. This is my result (my matlab code is attached down below):

enter image description here

As can be seen the posterior distribution of both parameters is very narrow and confident. However, we see that the predictive distribution does not fit the data points very well. I would assume that the uncertainty of the parameters would be much larger in order to compensate for the model mismatch (namely the quadratic term). Does someone know why my intuition does not fit to the results and has a possible fix/workaround? Thank you.

% - 2022-02-04
%% - code
% - cleanup
clear;
close all;
clc;

% - fix random number generator
seed = 29057;
rng(seed);

% - first function (ground truth, unkown)
fcn1.p = [1, 2, 3]';
fcn1.basefcn = @(x) [ones(size(x, 1), 1), x, x.^2];
fcn1.np = length(fcn1.p);

% - second function
fcn2.basefcn = @(x) [ones(size(x, 1), 1), x];
fcn2.np = 2;

% - data points
sn2 = 1;
nd = 100;
X = sort(randn(nd, 1));
y = fcn1.basefcn(X)*fcn1.p + sqrt(sn2)*randn(nd, 1);

% - prior
M0 = zeros(fcn2.np, 1);
S0 = eye(fcn2.np);
iS = inv(S0);

% - posterior
Z = fcn2.basefcn(X);
S = inv(iS +  Z'*Z/sn2);
M = S*(iS*M0 + Z'*y/sn2); %#ok<*MINV>

% - grid
grid0.n = 1000;
grid0.x = linspace(min(X), max(X), grid0.n)';
grid0.y = fcn1.basefcn(grid0.x)*fcn1.p;
grid0.ym = fcn2.basefcn(grid0.x)*M;
grid0.ys = sqrt(diag(sn2 + fcn2.basefcn(grid0.x)*S*fcn2.basefcn(grid0.x)'));

grid1.n = 1000;
grid1.min = min(M0(1)-5*sqrt(S0(1, 1)), M(1)-5*sqrt(S(1, 1)));
grid1.max = max(M0(1)+5*sqrt(S0(1, 1)), M(1)+5*sqrt(S(1, 1)));
grid1.p1 = linspace(grid1.min, grid1.max, grid1.n)';
grid1.prior = normpdf(grid1.p1, M0(1), sqrt(S0(1, 1)));
grid1.posterior = normpdf(grid1.p1, M(1), sqrt(S(1, 1)));

grid2.n = 1000;
grid2.min = min(M0(2)-5*sqrt(S0(2, 2)), M(2)-5*sqrt(S(2, 2)));
grid2.max = max(M0(2)+5*sqrt(S0(2, 2)), M(2)+5*sqrt(S(2, 2)));
grid2.p2 = linspace(grid2.min, grid2.max, grid1.n)';
grid2.prior = normpdf(grid2.p2, M0(2), sqrt(S0(2, 2)));
grid2.posterior = normpdf(grid2.p2, M(2), sqrt(S(2, 2)));

% - plot
figure(1);
subplot(2, 2, 1:2);
plot(grid0.x, grid0.y, 'g-');
hold on;
grid on;
plot(X, y, 'ko');
plot(grid0.x, grid0.ym + 2*grid0.ys, 'r-');
plot(grid0.x, grid0.ym - 2*grid0.ys, 'r-');
title('Predictive distribution');
xlabel('x');
ylabel('y');
legend('Ground truth', 'Data points', 'Predictive distribution (\mu\pm2\sigma)', 'Location', 'northwest');

subplot(2, 2, 3);
scatter(fcn1.p(1), 0, 'g', 'filled');
hold on;
grid on;
plot(grid1.p1, grid1.prior/max(grid1.prior), 'b-');
plot(grid1.p1, grid1.posterior/max(grid1.posterior), 'r-');
title('First parameter');
xlabel('b_1');
ylabel('p(b_1)');
legend('Ground truth', 'Prior', 'Posterior', 'Location', 'northwest');

subplot(2, 2, 4);
scatter(fcn1.p(2), 0, 'g', 'filled');
hold on;
grid on;
plot(grid2.p2, grid2.prior/max(grid2.prior), 'b-');
plot(grid2.p2, grid2.posterior/max(grid2.posterior), 'r-');
title('Second parameter');
xlabel('b_2');
ylabel('p(b_2)');

Edit: A follow-up post can be found here.

Best Answer

You should remember that in Bayesian statistics, the outcome (posterior probability) always depends on your prior assumptions (expressed as a prior distribution). In your case, your assumption is is that the response is linear. The meaning of the small uncertainties in this case, is that under such as an assumption, the parameters can be determined with confidence.

Another way to put this is if you were absolutely 100% sure that the true response is linear, those would be the uncertainties you would assign to the fit parameters.

But if you are not 100% sure then that's a different story - then you have to account for your prior lack of knowledge by choosing a different prior for the model(s). You cannot expect the Bayesian calculation by itself to tell you if your choice of prior is correct or not.

More simply - is there a reason you don't want to include the quadratic term in the fit ?