Solved – equivalent of t-test for Binomial/Poisson variables

p-valuepoisson distributionskellam-distributiont-test

I have to try to estimate and explain conversion rates that can be extremely low, on limited dataset.

Because I have very few observations, a normal framework would give me a poor estimate, because the population times the convertion rate is too small for my binomial laws to converge towards normal laws.

Thus, I was wondering what kind of test I could apply to compare these ?

==> The question I need to answer : How confident are we that A conversion rate is higher than B ?

I'm scared to use a t-stat because I don't know how close we are from having converged to a normal framework, a typical example would be :

sample A = 100 000 tries, 20 successes
sample B = 100 000 tries, 15 successes

We assume Success(A) and Success(B) are independent binomial distibutions of parameters 100 000 and lambda(A) (resp. lambda(B) )

I thought of several variants :

I was thinking of setting H0 = {lambda(A)=lambda(B)=Average conversion of both}
and testing for p-value = P(Success(A)-Success(B) > observed value), and approximating A and B as Poisson
In my example, in H0, lambda(A)=lambda(B)=0.000175, and Success(A)-Success(B) is a Skellam distribution . However, is there a way to compute a repartition funciton ? Is my hypothesis on the average conversion a bit of an exaggeration ?

-> I guess I could also look for the lambda that maximizes the p-value, but it is even more complicated to solve theoretically

-> I also wondered if I should use unilateral or bilateral confidence interval

Basically, I'm having trouble adapting the t-stat method to a non homoskedastic and non continuous variable, so I'm wondering fundamental questions about p-value.

Any source on this (i.e. what happens before limit central theorem comes into play) would also be welcome.

First post in here, don't hesitate to tell me if another exchange is more suitable for my question.

Best Answer

In statistical terms, you observe two independent Binomial random variables $X_1 \sim \text{Bin}(n_1,p_1)$ and $X_2 \sim \text{Bin}(n_2,p_2)$ and want to test the null hypothesis $H_0 : p_1=p_2$. Fisher's exact test is appropriate here. In your example you have $n_1=n_2=100000$ and observe $X_1=20$ and $X_2=15$. The P-value can be computed in R as follows:

fisher.test(matrix(c(20,15,100000-20,100000-15),2,2))

giving $P=0.4995$ in your example. Since the number of trials (100000) in each case is large compared to the number of successes, the related test for Poisson random variables gives practically the same result:

poisson.test(c(20,15))

giving $P=0.4996$.

Edit: These computations are based on a two-sided alternative but can easily be adapted if a one-sided test is desired.

Related Solutions

Negative Binomial Distribution – Distribution Describing the Difference Between Negative Binomial Distributed Variables

I don't know the name of this distribution but you can just derive it from the law of total probability. Suppose $X, Y$ each have negative binomial distributions with parameters $(r_{1}, p_{1})$ and $(r_{2}, p_{2})$, respectively. I'm using the parameterization where $X,Y$ represent the number of successes before the $r_{1}$'th, and $r_{2}$'th failures, respectively. Then,

$$ P(X - Y = k) = E_{Y} \Big( P(X-Y = k) \Big) = E_{Y} \Big( P(X = k+Y) \Big) = \sum_{y=0}^{\infty} P(Y=y)P(X = k+y) $$

We know

$$ P(X = k + y) = {k+y+r_{1}-1 \choose k+y} (1-p_{1})^{r_{1}} p_{1}^{k+y} $$

and

$$ P(Y = y) = {y+r_{2}-1 \choose y} (1-p_{2})^{r_{2}} p_{2}^{y} $$

$$ P(X-Y=k) = \sum_{y=0}^{\infty} {y+r_{2}-1 \choose y} (1-p_{2})^{r_{2}} p_{2}^{y} \cdot {k+y+r_{1}-1 \choose k+y} (1-p_{1})^{r_{1}} p_{1}^{k+y} $$

That's not pretty (yikes!). The only simplification I see right off is

$$ p_{1}^{k} (1-p_{1})^{r_{1}} (1-p_{2})^{r_{2}} \sum_{y=0}^{\infty} (p_{1}p_{2})^{y} {y+r_{2}-1 \choose y} {k+y+r_{1}-1 \choose k+y} $$

which is still pretty ugly. I'm not sure if this is helpful but this can also be re-written as

$$ \frac{ p_{1}^{k} (1-p_{1})^{r_{1}} (1-p_{2})^{r_{2}} }{ (r_{1}-1)! (r_{2}-1)! } \sum_{y=0}^{\infty} (p_{1}p_{2})^{y} \frac{ (y+r_{2}-1)! (k+y+r_{1}-1)! }{y! (k+y)! } $$

I'm not sure if there is a simplified expression for this sum but it could be approximated numerically if you only need it to calculate $p$-values

I verified with simulation that the above calculation is correct. Here is a crude R function to calculate this mass function and carry out a few simulations

  f = function(k,r1,r2,p1,p2,UB)  
  {

  S=0
  const = (p1^k) * ((1-p1)^r1) * ((1-p2)^r2)
  const = const/( factorial(r1-1) * factorial(r2-1) ) 

  for(y in 0:UB)
  {
     iy = ((p1*p2)^y) * factorial(y+r2-1)*factorial(k+y+r1-1)
     iy = iy/( factorial(y)*factorial(y+k) )
     S = S + iy
  }

  return(S*const)
  }

 ### Sims
 r1 = 6; r2 = 4; 
 p1 = .7; p2 = .53; 
 X = rnbinom(1e5,r1,p1)
 Y = rnbinom(1e5,r2,p2)
 mean( (X-Y) == 2 ) 
 [1] 0.08508
 f(2,r1,r2,1-p1,1-p2,20)
 [1] 0.08509068
 mean( (X-Y) == 1 ) 
 [1] 0.11581
 f(1,r1,r2,1-p1,1-p2,20)
 [1] 0.1162279
 mean( (X-Y) == 0 ) 
 [1] 0.13888
 f(0,r1,r2,1-p1,1-p2,20)
 [1] 0.1363209

I've found the sum converges very quickly for all of the values I tried, so setting UB higher than 10 or so is not necessary. Note that R's built in rnbinom function parameterizes the negative binomial in terms of the number of failures before the $r$'th success, in which case you'd need to replace all of the $p_{1}, p_{2}$'s in the above formulas with $1-p_{1}, 1-p_{2}$ for compatibility.

Solved – logloss equivalent for poisson regression

When fitting a GLM, the deviance is something you'd like to see as low as possible. I believe for a binomial GLM, the binomial deviance is already the log loss. If you run a Poisson GLM, the Poisson deviance should be the number you're looking for. It's spit out by default in glm() in R.

Check it out what it actually is here.

A bonus benefit to using deviance, is that you can use it to compare models via a $\chi^2$ test.

Best Answer

Related Solutions

Negative Binomial Distribution – Distribution Describing the Difference Between Negative Binomial Distributed Variables

Solved – logloss equivalent for poisson regression

Related Question