Solved – Generalizing Add-one/Laplacian Smoothing

bayesianprobabilitysmoothing

Let us assume we are estimating a proportion or rate of "hits". If we have $h$ hits and $m$ misses, the obvious estimator is

$\dfrac{h}{h + m}$

In order to avoid unreasonable estimations of $0$ or $1$ when our sample size is small, we can do some (Add-1/Laplacian) smoothing:

$\dfrac{h+1}{h+m+2}$

I have read that this has a Bayesian interpretation of having a 50/50 prior over the hit-rate. A couple ideas about generalizing this spring to mind, but I'm uncertain as to the theory.

  1. Level of confidence

    If I'm very confident that the hit rates are 50/50, I could add $2$ instead of $1$ to the hits and misses. Or if I'm less confident, I could add $1/2$. What doesn't immediately make sense to me though is what the Bayesian interpretation (if any) is. Isn't the prior just $p = 0.5$, and that's that? Or is there a natural way to represent concentration? If so, what level of concentration does the Laplacian add-one smoothing correspond to? If not, why doesn't this variable-confidence scheme make sense?

  2. Different prior probabilities

    Instead of a uniform prior, we could have some other prior over the hit rate. To accomplish this, we could add some number to the hits and some number to the misses such that the proportion worked out. However, I don't immediately know how to parameterize it. For instance, if I have a prior of 1/4, should I add $0.5$ and $1.5$ to the hits and misses, or should I add $1$ and $3$? This ties into the previous question about level of confidence. I'd like to parameterize this so I can change the prior probability without altering the confidence (if such a concept makes sense).

Best Answer

The "uniform" in uniform prior doesn't just mean that hits and misses are equally likely. It means that you assume that you have a probability measure on the rates $[0,1]$ and this measure is the uniform measure. For example, it means the chance the true rate is between $0.9$ and $1.0$ is $0.1$. There are other measures on $[0,1]$ which have mean value $1/2$.

For example, if you start with the uniform prior and then observe $1$ hit and $1$ miss, the updated distribution is more concentrated around a rate of $1/2$ than the uniform prior. Instead of a probability of $\Delta r $ for the rate to be between $r$ and $r+\Delta r$, you would estimate the probability to be about $6 r (1-r) \Delta r$. (The factor of $6$ makes the total measure $1$.) This new distribution would also predict that the probability of getting a hit next time is $1/2$. If you observe an additional $h$ hits and $m$ misses, the expected probability of a hit is

$$\frac{h+2}{h+m+4}.$$

This is closer to $1/2$ than $\frac{h+1}{h+m+2}$. This agrees with the fact that you started with a distribution which was more concentrated near $1/2$. Of course, if you include that first hit and miss then there have been $h+1$ hits and $m+1$ misses in total.

The distribution you get from the uniform prior by observing some number of hits and misses is called a beta distribution. The uniform distribution on $[0,1]$ is $\text{Beta}(1,1)$. Beta distributions have the nice property that if you update a beta distribution with an observation, the result is still in that family. From one observation, the $\text{Beta}(a,b)$ distribution is updated either to $\text{Beta}(a+1,b)$ or $\text{Beta}(a,b+1)$. The mean of a $\text{Beta}(a,b)$ distribution is $\frac{a}{a+b}$. It is perfectly reasonable to have a prior distribution which is not a beta distribution, but the answers might not come out as nicely.

If I'm very confident that the hit rates are 50/50, I could add 2 instead of 1 
to the hits and misses. Or if I'm less confident, I could add 1/2.

That might or might not agree with confidence. That is about whether you think it is likely that the rate is close to $1/2$. In some situations, you may believe the rate should be close to $0$ or close to $1$, and the first trial should be very informative. You might believe that half of your students know how to solve a type of problem, and that if you give a random student $5$ of these problems, the student is very likely to solve all $5$ or fail to solve any. This might be approximated by a $\text{Beta}(\epsilon,\epsilon)$ distribution so that after $h$ correct and $m$ incorrect, the average value is

$$\frac{h+\epsilon}{h+m+2\epsilon},$$

which is close to $0$ or $1$ after one observation.

Related Question