Solved – Incorporating the effect of sample size in maximum likelihood estimation

confidence intervalmaximum likelihoodnaive bayessample-sizestandard error

I have two sample sets that consist of independent trials of a binomial variable X = {X0, X1}. For the remainder, I denote the probabilities as p0= p(X=X0) and p1= p(X=X1); the sample sizes as N1 and N2 and the binomial coefficients as C1 and C2.

1st Sample set: N1= 10 and the count of X0 observations is 2.

Hence, the maximum likelihood estimation (MLE) for p0=2/10. The likelihood function for this distribution is: Lik1 = C1 * (p0^2) * (1-p0)^8, which is plotted as:

MLE of p0 for N1

2nd Sample set: N2= 100 and the count of X0 observations is 20.

Hence, the maximum likelihood estimation (MLE) for p0=20/100. The likelihood function for this distribution is: Lik2 = C2 * (p0^20) * (1-p0)^80, which is plotted as:

enter image description here

I know this much: As N is increased, lik(P;x) becomes more sharply peaked around the MLE (0.2), which means the width of the confidence interval decreases. I also understand that a confidence interval is not a probability statement for the MLE (0.2) since the MLE is a fixed value and not a random variable.

What I do not know: I am using an algorithm (i.e. Naive Bayes) that requires the multiplication of the MLEs of multiple variables similar to X. For instance: P(Class|X1,X2) = D * P(X1|Class)*P(X2|Class)*P(Class), where D is the normaliser and I estimate all P(Xn|Class)'s via MLE.

Is there a statistically valid method, whereby I can incorporate a confidence weight (derived from the sample size) for each independent variable's MLE? In other words, I want to be able to attach a sample-size-related confidence weight to each MLE while multiplying them. I hope the question is clear enough.

Reading around a bit, I found that the approximate standard error for the MLE can be calculated by 1/sqrt(-lik''(P;x)) where -lik''(P;x) is the second derivative of the likelihood function wrt p0 and is called the observed information. But I guess multiplying each MLE by its standard error does not make much sense..

Best Answer

The most practical approach I have come up with is to sample from the likelihoods. I can't say that this is statistically valid, but it does seem to make sense intuitively and take account of the information that the likelihoods provide, giving narrower intervals where the likelihoods are narrower. The motivation behind what I've done is to perturb the inputs to understand the stability of an estimate. And the likelihoods give information about how much to perturb them.

A rough and ready implementation in R is as follows:

set.seed(99)

acc <- 1000
lik1 <- function(p) { p^2 * (1-p)^8 }
lik2 <- function(p) { p^20 * (1-p)^80 }
x <- (1:acc)/acc 
clik1 <- cumsum(lik1(x))
clik2 <- cumsum(lik2(x))

nrand <- 1000000
samplelik1 <- findInterval(runif(nrand, max=clik1[length(clik1)]), clik1) / acc
samplelik2 <- findInterval(runif(nrand, max=clik2[length(clik2)]), clik2) / acc
quantile(samplelik1*samplelik2, c(.025,.975))
    2.5%    97.5% 
    0.011319 0.114240

Here, I've normalized the likelihood and treated it as a pdf for the probability (which isn't valid for several reasons but might serve your purpose). So clik1 is the "cdf", and the probability integral transform is used in the standard way to go from a uniform random variable, using runif, to sample the desired random variable via the inverse cdf, using findInterval.

As a test, replacing the first likelihood samplelik1 with a narrower one samplelik3 gives a narrower interval.

lik3 <- function(p) { p^200 * (1-p)^800 }
clik3 <- cumsum(lik3(x))
samplelik3 <- findInterval(runif(nrand, max=clik3[length(clik3)]), clik3) / acc
quantile(samplelik3*samplelik2, c(.025,.975))
    2.5%      97.5% 
    0.02594400 0.05863703 

This can be visualized in a hacky way:

par(mfrow=c(2,1))
plot(density(samplelik1*samplelik2),xlim=c(0,0.2));
abline(v=quantile(samplelik1*samplelik2, c(.025,.975)), col="red")
plot(density(samplelik3*samplelik2),xlim=c(0,0.2));
abline(v=quantile(samplelik3*samplelik2, c(.025,.975)), col="red")

enter image description here

Related Question