Bayesian – How to Use Kernel Density Estimate in Naive Bayes Classifier?

bayesiankde

This question is a follow up to my earlier question here
and is also related, in intent, to this question.

On this wiki page probability density values from an assumed normal distribution for the training set are used to calculate a Bayesian posterior rather than actual probability values. However, if a training set is not normally distributed would it be equally as valid to use a density value taken from the kernel density estimate of the training set to calculate a Bayesian posterior?

In its intended application this kernel density estimate would be taken from a theoretically ideal empirical data set generated by MC techniques.

Best Answer

I have read both the first linked earlier question, especially the answer of whuber and the comments on this.

The answer is yes, you can do that, i.e. using the density from a kde of a numeric variable as conditional probability ($P(X=x|C=c)$ in the bayes theorem. $P(C=c|X=x)=P(C=c)*P(X=x|C=c)/P(X=x)$

By assuming that d(height) is equal across all classes, d(height) is normalized out when the theorem is applied, i.e. when $P(X=x|C=c)$ is divided by $P(X=x)$.

This paper could be interesting for you: estimating continuous distributions in bayesian classifiers

Related Solutions

Kernel Density – Interpretation and Use of Kernel Density Estimation

No, I'm afraid not. The kernel density estimand is the probability density function. The y-value is an estimate of the probability density at that value of x, so the area under the curve between x₁ and x₂ estimates the probability of the random variable X falling between x₁ and x₂, assuming that X was generated by the same process that generated the data which you fed into the kernel density estimate. The kernel density estimate doesn't say anything about the probability a new value was generated by the same process.

Bayesian Estimation – Comprehensive Guide to Bayesian MAP Estimates

The M in MAP stands for "Maximum", so it should be or no surprise that optimization is involved . In order to obtain estimates for the parameters, you compute

$$ \hat{\theta} = \underset{\beta \in \mathbb{R}^2}{\text{argmax}} \left\{ \text{Log Prior} + \text{Log Likelihood} \right\}$$

You can't do MAP with samples from the posterior (and I'm not sure why you would need to)

Let's take a look at an example:

Let's say my prior for my slope is a normal distribution with mean 0 and standard deviation 2

$$ \beta_1 \sim \mathcal{N}(0, 2^2) $$

and my prior for my intercept is a normal distirbution with mean 1 and standard deviation 1

$$ \beta_0 \sim \mathcal{N}(0, 1^2) $$

I'm going to generate some data and write out some functions to compute log prior, log likelihood, and log posterior.

Let's start with the log prior. If the priors for the parameters are independent (meaning we don't place priors on the joint distirbution of the slope/intercept) then the log prior is the sum of the log priors for each parameter.

In R...

log_prior = function(theta){
  
  intercept_log_prior = dnorm(theta[1], mean = 1, sd = 1, log=T)
  
  slope_log_prior = dnorm(theta[2], mean = 0, sd = 2, log=T)
  
  intercept_log_prior + slope_log_prior
  
}

Here, theta is a two element vector. The first is the intercept, the second is the slope. Now, we need the log likelihood. For linear regression, that is typically a gaussian

$$ y \sim \mathcal{N}(\beta_0 + \beta_1 x, \sigma) $$

Usually, we have to estimate $\sigma$ by placing a prior on it, but I'm going to pretend we know it already. Here is the log likelihood in R

log_lik = function(theta, x, y){
  
  sum(dnorm(y, mean = theta[1] + theta[2]*x, log = T))
  
}

Note how I am using x and theta to compute the conditional mean of $y$ for evaluation in the log likelihood.

Now, its just a straight forward optimization problem. Let's generate some data, define our objective function (log posterior) and optimize

# data
set.seed(0)
x=rnorm(100)
y = 2*x + 1 + rnorm(100)

# Objective function (Note the negative signs because we maximize the negative log posterior)

log_posterior = function(theta, x, y){
  -log_prior(theta) - log_lik(theta, x, y)
}

# Optimize!
theta_est = optim(c(0,0), log_posterior, x=x, y=y)$par

Here is the result from the MAP estimation procedure I've outlined

Best Answer

Related Solutions

Kernel Density – Interpretation and Use of Kernel Density Estimation

Bayesian Estimation – Comprehensive Guide to Bayesian MAP Estimates

Related Question