Solved – Lower Bound on the Total Variation Distance between two Binomials

approximationbinomial distributiondistancenormal-approximation

Let $X= B(n,1/2)$, $Y=B(n,1/2 + \delta)$, for a small $\delta >0$
be two Binomial Distributions.

Question 1.

I am looking for a lower bound on the Total Variation Distance
the two Binomials $X,Y$.

My attempt at deriving a lower bound is the following:
Since $X, Y$ have huge variance we can approximate each of them
very well with a discretized Normal and then lower bound the total variation distance of the two Normals. My problem here is that I
am not sure how to go from discretized Normals to continuous Normals.

Question 2.

Having two discretized normals as defined in this paper which are
in Total Variation distance $\epsilon$ then is it true that the continuous
Normals with the same mean and variance are also in total variation distance at most $\epsilon$ ?

Best Answer

The approach that seems most straightforward is to rewrite this problem in terms of an expectation and then bound the expectation. Start w/ $||X -Y||_{TV}=\sum_{x=0}^n |P(x)-P(y)|$ and rewrite the r.h.s.

$$\sum_{x=0}^n |{n\choose x}2^{-n}-{n\choose x}(1/2+\delta)^x (1/2-\delta)^{n-x}| = \sum_{x=0}^n {n\choose x}2^{-n}|1-2^{n}(1/2+\delta)^x (1/2-\delta)^{n-x}|$$ and then rewrite again by multiplying through the $2^n$ factor inside the absolute value,

$$\sum_{x=0}^n {n\choose x}2^{-n}|1-2^{n}(1/2+\delta)^x (1/2-\delta)^{n-x}|=\sum_{x=0}^n {n\choose x}2^{-n}\left|1-(1-2\delta)^{n}\left[\frac{(1+2\delta)}{(1-2\delta)}\right]^x\right|.$$

Now rewrite the last expression (r.h.s) as an expectation of $X\sim binom(n, 1/2)$ via:

$$\sum_{x=0}^n {n\choose x}2^{-n}\left|1-(1-2\delta)^{n}\left[\frac{(1+2\delta)}{(1-2\delta)}\right]^x\right|=\mathbb{E}_X\left|1-(1-2\delta)^{n}\exp[Xk]\right|,$$ where $k=log\left(\frac{(1+2\delta)}{(1-2\delta)} \right)>0.$

Now use whichever bound you want to use, e.g. the argument of Markov's inequality yields as one step $\Pr(g(X) \geq g(a))g(a) \leq \mathbb{E}(g(X)),$ provided $g(X)\geq 0, \forall X$, for all $a \geq 0$, a conservative choice is $a=\lfloor np\rfloor$. Now, $\Pr(g(X) \geq 0)=1$, and $g(\lfloor np\rfloor)=|1-(1-2\delta)^{n}\exp{\left[\lfloor np\rfloor k \right]}| \geq |1-(1-2\delta)^{n}(-1-\lfloor np\rfloor k) |$. Where $n \geq 1 $ and $n \in \mathbb{Z}_+$. Provided $\delta < \frac{1}{2}$ the $g(0)$ function goes to 1 as $n\to \infty$.

Related Solutions

Solved – Distribution of the Levenshtein distance between two random strings

I guess it is not a direct answer, but you can try to simulate the scenario and check the empirical distribution to have a rough idea (R code below).

enter image description here

So it seems that for strings longer than ~ 100 the distribution is symmetric and quite narrow around .53 times the length of the string.

# dependencies

library(ggplot2); theme_set(theme_classic())
library(parallel)
library(RecordLinkage)

# settings

alphabet <- c("A", "C", "G", "T")
Nsim <- 1e3
read_lengths <- seq(60, 500, 20)


# function to create a random string of length "n" using letters of the alphabet "alph"

random_read <- function(n, alph=alphabet) paste(sample(alph, size=n, replace=T), collapse="")

# simulate

res <- mclapply(read_lengths,
                function(N) replicate(Nsim, levenshteinDist(random_read(N), random_read(N))),
                mc.cores=6)

# arrange results as data.frame

res_df <- data.frame(dist=unlist(res),
                     length=rep(read_lengths, sapply(res, length)))

# plot densities

ggplot(res_df,
       aes(x=dist / length, col=length, group=length)) +
  geom_density() +
  ggtitle("Distribution of Levenshtein distance / length")


ggplot(res_df,
       aes(x=length, y=dist / length, col=length)) +
  geom_violin(aes(group=length)) +
  geom_smooth(col="black", lwd=1) +
  ggtitle("Distribution of Levenshtein distance / length")

Solved – Quantify Difference/Distance between Lognormal distributions

For what purpose do you need the distance? For some purposes, like hypothesis testing or discrimination, the kullback-Leibler divergence is useful, as it really gives the expected value of the likelihood ratio statistic, see Intuition on the Kullback-Leibler (KL) Divergence

An expression for that distance in the lognormal (and a lot of other cases) can be found at http://www.mast.queensu.ca/~linder/pdf/GiAlLi13.pdf

I give the expression for the lognormal case below (from above paper): $$ D(f_i||f_j)= \frac1{2\sigma_j^2}\left[(\mu_i-\mu_j)^2+\sigma_i^2-\sigma_j^2\right] + \ln \frac{\sigma_j}{\sigma_i} $$

Best Answer

Related Solutions

Solved – Distribution of the Levenshtein distance between two random strings

Solved – Quantify Difference/Distance between Lognormal distributions

Related Question