The M in MAP stands for "Maximum", so it should be or no surprise that optimization is involved . In order to obtain estimates for the parameters, you compute
$$ \hat{\theta} = \underset{\beta \in \mathbb{R}^2}{\text{argmax}} \left\{ \text{Log Prior} + \text{Log Likelihood} \right\}$$
You can't do MAP with samples from the posterior (and I'm not sure why you would need to)
Let's take a look at an example:
Let's say my prior for my slope is a normal distribution with mean 0 and standard deviation 2
$$ \beta_1 \sim \mathcal{N}(0, 2^2) $$
and my prior for my intercept is a normal distirbution with mean 1 and standard deviation 1
$$ \beta_0 \sim \mathcal{N}(0, 1^2) $$
I'm going to generate some data and write out some functions to compute log prior, log likelihood, and log posterior.
Let's start with the log prior. If the priors for the parameters are independent (meaning we don't place priors on the joint distirbution of the slope/intercept) then the log prior is the sum of the log priors for each parameter.
In R...
log_prior = function(theta){
intercept_log_prior = dnorm(theta[1], mean = 1, sd = 1, log=T)
slope_log_prior = dnorm(theta[2], mean = 0, sd = 2, log=T)
intercept_log_prior + slope_log_prior
}
Here, theta
is a two element vector. The first is the intercept, the second is the slope. Now, we need the log likelihood. For linear regression, that is typically a gaussian
$$ y \sim \mathcal{N}(\beta_0 + \beta_1 x, \sigma) $$
Usually, we have to estimate $\sigma$ by placing a prior on it, but I'm going to pretend we know it already. Here is the log likelihood in R
log_lik = function(theta, x, y){
sum(dnorm(y, mean = theta[1] + theta[2]*x, log = T))
}
Note how I am using x
and theta
to compute the conditional mean of $y$ for evaluation in the log likelihood.
Now, its just a straight forward optimization problem. Let's generate some data, define our objective function (log posterior) and optimize
# data
set.seed(0)
x=rnorm(100)
y = 2*x + 1 + rnorm(100)
# Objective function (Note the negative signs because we maximize the negative log posterior)
log_posterior = function(theta, x, y){
-log_prior(theta) - log_lik(theta, x, y)
}
# Optimize!
theta_est = optim(c(0,0), log_posterior, x=x, y=y)$par
Here is the result from the MAP estimation procedure I've outlined
Best Answer
I have read both the first linked earlier question, especially the answer of whuber and the comments on this.
The answer is yes, you can do that, i.e. using the density from a kde of a numeric variable as conditional probability ($P(X=x|C=c)$ in the bayes theorem. $P(C=c|X=x)=P(C=c)*P(X=x|C=c)/P(X=x)$
By assuming that d(height) is equal across all classes, d(height) is normalized out when the theorem is applied, i.e. when $P(X=x|C=c)$ is divided by $P(X=x)$.
This paper could be interesting for you: estimating continuous distributions in bayesian classifiers