[Math] Chebyshev’s inequality, variance and mean

calculusstatistics

I am trying to implement a solution (working code) for the 4.1 paragraph in this paper.

The problem:

We have words with lengths for instance:
$l_1$ = 1, $l_2$ = 2, $l_3$ = 3, $l_4$ = 8 and $l_5$ = 7.

These words will be part of the white-list.

We calculate the sample mean and the variance of the lengths of these words.

$\mu = \frac{1}{N}\sum_{i = 1}^N X_i$

So, $\mu = 4.2$ in our case.

Next step is to calculate the variance.

$\sigma^2 = \frac{1}{N}\sum_{i = 1}^N (X_i – \mu)^2$

So, $\sigma^2 = 7.76$

After all calculations are done we get another list of words and the goal of the algorithm is to assess the anomaly of a string with length l, by calculating the ''distance'' of the length l from the mean $\mu$ of value l of the length distribution.

This distance is expressed with the help of the Chebyshev inequality.

$p(\mid x-\mu \mid > t) < \frac{\mu^2}{t^2}$

When l is far away from $\mu$, considering the variance of the length distribution, then the probability of any (legitimate) string x having a greater length than l should be small.
Thus, to obtain a quantitative measure of the distance between a string of length l and the mean $\mu$ of the length distribution, we substitute t with the difference between $\mu$ and l.

$p(\mid x-\mu \mid > \mid l-\mu \mid) < p(l)=\frac{\sigma^2}{(l-\mu)^2}$

Having the information above, if I run it with the next numbers: 1, 5, 10. I get these probabilities:

p(1) =0.757

p(5) =12.125

p(10) =0.230

Which I don't understand why some probabilities I get are bigger than 1, they are not supposed to be bigger than 1. I am trying to understand if the formulas described above are correct or maybe I am using them wrong.

Thank you.

Best Answer

You have stated the Chebyshev inequality incorrectly in one place, but correct it later. It should be $p(\mid x-\mu \mid > t) < \frac{\sigma^2}{t^2}$ if $t$ has units like $x$. $p(l)=\frac{\sigma^2}{(l-\mu)^2}$ is not the probability of length $l$ or even $l$ or greater. It is just a limit that the number of items that far from the mean can't be more than. So yes, the Chebyshev inequality says that $p(5)<12.125$, which is a correct statement. If the mean were exactly $5$, the RHS would be infinite and $p(5)$ would still be less than that. The Chebyshev inequality is just useless for items within $1$ standard deviation from the mean. Your calculation for $p(10)$ is violated-1 item in 3 is that far away, which is greater than $0.230$. I get $\sigma^2=\frac{122}{9}, \mu=\frac{16}{3}$, which leads to $p(10)=\frac{122}{196}\approx 0.622$

Related Question