Solved – Maximizing likelihood versus MCMC sampling: Comparing Parameters and Deviance

I am working in R. I use lm() for maximizing the likelihood in the first analysis, and STAN to sample from the posterior in a second analysis.

require(rstan)

I have fabricated some data.

set.seed(123)

N      <- 1000
data   <- as.data.frame(sapply(1:3,function(x)rnorm(N)))
data$y <- 2 + data$V1 + 2*data$V2 + 4*data$V3 + rnorm(N,0,4)

I fit a linear model to the data. I'll present results later.

lm1 <- lm(y ~ V1 + V2 + V3,data = data)

Next, I use STAN to sample from the posterior distribution of a very similar model (same model, but STAN automatically applies uniform priors on parameters where a prior is not specified). Below is the STAN model, which is saved in a file called LM.stan.

data{
    int N;
    real y[N];
    matrix[N,3] X;
}
parameters{
    real a;
    vector[3] b;
    real sigma;
}
model{
   y ~ normal( a + X*b, sigma );
}
generated quantities{
    real logLik;
    vector[N] lm;
    lm <- a + X*b;
    for( i in 1:N ) {
        logLik <- logLik + normal_log( y[i], lm[i], sigma );
    }
}

Below is the R code that creates a data object that STAN can cope with, and compiles and executes the sampler.

standata <- list(X = model.matrix(~-1+V1+V2+V3,data=data),
                 N = nrow(data),
                 y = data$y
            )
lm2 <- stan("LM.stan",verbose=F,iter=1000,data=standata,refresh=100)

We can examine the means and standard deviations of the marginal distributions of each parameter, and the mean log likelihood.

samples    <- do.call(cbind,extract(lm2,c("a","b","sigma","logLik")))
colnames(samples) <- c('a','b1','b2','b3','sig','logLik')
fitTab     <- data.frame(t(apply(samples,2,function(x) c(mean=mean(x),sd=sd(x)))))
fitTab
#                mean        sd
#a          1.9671178 0.1272709
#b1         0.9951333 0.1290366
#b2         1.9647771 0.1263366
#b3         4.2091996 0.1339271
#sig        3.9764835 0.0842643
#logLik -2798.2644784 1.6468066

And for comparison, print the maximum likelihood parameter estimates and the maximum log likelihood found using lm()

summary(lm1)

#Call:
#lm(formula = y ~ V1 + V2 + V3, data = data)
#
#Residuals:
#     Min       1Q   Median       3Q      Max 
#-12.2247  -2.5338   0.0019   2.7010  11.5690 
#
#Coefficients:
#            Estimate Std. Error t value Pr(>|t|)    
#(Intercept)   1.9690     0.1257  15.664  < 2e-16 ***
#V1            0.9948     0.1272   7.823 1.31e-14 ***
#V2            1.9675     0.1249  15.750  < 2e-16 ***
#V3            4.2059     0.1285  32.740  < 2e-16 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 3.97 on 996 degrees of freedom
#Multiple R-squared:  0.5884,   Adjusted R-squared:  0.5872 
#F-statistic: 474.7 on 3 and 996 DF,  p-value: < 2.2e-16

logLik(lm1)
#'log Lik.' -2795.739 (df=5)

I notice that that parameter estimates and the standard errors are very similar to the means and SDs of the posterior samples. However, the log likelihoods are different. I recognize that the likelihood reported for the posterior samples is a mean, so I check what is the largest likelihood value from the samples and what are the parameter values associated with it. Sure enough, the largest likelihood value is very similar to the maximum likelihood of the data given the OLS model.

t(samples[samples[,'logLik'] == max(samples[,'logLik']), , drop=F])
#                [,1]
#a          1.9935327
#b1         0.9750932
#b2         1.9817911
#b3         4.2507953
#sig        3.9632800
#logLik -2795.8380670

However, the parameter values that resulted in the largest likelihood from the samples differ quite a bit from the OLS estimates.

To sumarize, the two strange and most likely related facts that I observe are that 1) the marginal parameter means look a lot like the OLS estimates, but the marginal log likelihood mean doesn't look like the maximum likelihood (from OLS), and 2) the parameters associated with the largest likelihood sample (from STAN), which is very close in value to the OLS likelihood, do not match those of the maximum likelihood estimates (from OLS).

Why should I not be surprised by this (particularly fact 2)?

Best Answer

Imagine that your posterior is somewhat Gaussian. The expectation of the squared euclidean norm of the best of $N$ draw depends on the dimension $d$ and is asymptotically $O(1/N^{2/d})$, but the average remains $O(1/N)$. As soon as you have more than 2 dimensions, the average converges faster. You have $d=4$ so the average of the points around the mode will be a much better estimate of the mode than the point closest to the mode.

If that's not immediately intuitive, imagine you're drawing from a standard multivariate normal with identity covariance and $d=1000$. The closest vector to $0$ is still going to be very far on average; in high dimensions, most of the mass of a Gaussian is away from the center. The average will be much closer as values from different draws cancel out in each dimensions.

Try it in python

def draw(d):
   x = np.random.randn(1000*d).reshape((1000,d))
   return np.sum(np.mean(x,axis=0)**2) < np.min(np.sum(x**2,axis=1))

for d in range(1,4):
   print d, np.mean( [draw(d) for i in range(0,1000)] )

Edit: so I got nerdsniped into computing the multiplicative factor in the expectation of the minimum squared euclidean norm when drawing from a standard multivariate normal with dimension $d$. It is asymptotically equivalent to

$$2\Gamma\left(1+\frac{d}{2}\right)^{\frac{2}{d}}\Gamma\left(1+\frac{2}{d}\right) n^{-2/d}$$

while the average is of course $d n^{-1}$

Best Answer

Related Solutions

Solved – How to compute the posterior predictive distribution of a logit model

Solved – Maximum likelihood parameters deviate from posterior distributions

Related Question