Hypothesis Testing – Finding Distribution of the Statistic and Critical Region of the Generalized Test at Level $\alpha$ for Two Sample Test

hypothesis testingself-study

Let $X_i$ ($i=1,\dots, n$) be a random sample from $X\sim \exp(\lambda_1)$ and $Y_j$ ($j=1,\dots, m$) be a random sample from $Y\sim \exp(\lambda_2)$, and $X$ and $Y$ be independent. I try to find the generalized test of $H_0: \lambda_1=\lambda_2$ v.s. $H_1: \lambda_1\neq \lambda_2$. Find the distribution of the statistic and the critical region of the generalized test at level $\alpha$.


My work: The likelihood function is that for $\theta=(\lambda_1, \lambda_2)$
$$
L(\theta)=\lambda_1^n\lambda_2^m \exp(-n\lambda_1\bar{X}-m\lambda_2\bar{Y})
$$

where $\bar{X}$ and $\bar{Y}$ are sample mean.

Then I want to get the likelihood ratio statistic:
\begin{align}
\Lambda(x) &= \frac{\sup_{\theta=\theta_0}L(\theta\mid X)}{\sup_{\theta\neq\theta_0}L(\theta\mid X)}
\end{align}

The global MLE are $\hat{\lambda}_1=\frac{1}{\bar{X}}$ and $\hat{\lambda}_2=\frac{1}{\bar{Y}}$. The restricted MLE for $\lambda_1=\lambda_2$ is $$\lambda_0=\frac{m+n}{n\bar{X}+m\bar{X}}$$

So we have
$$
\Lambda=\frac{(m+n)^{m+n}}{n^nm^m}[\frac{n\bar{X}}{n\bar{X}+m\bar{Y}}]^n[1-\frac{n\bar{X}}{n\bar{X}+m\bar{Y}}]^m
$$

So we take the statistic $$T=\frac{n\bar{X}}{n\bar{X}+m\bar{Y}}$$

So to find the critical region, we need $$\Lambda=CT^n(1-T)^m\le \lambda_0$$

From here, I am not sure how to solve that. It seems that $CT^n(1-T)^n$ is decreasing if $T\le n/(n+m)$ and increasing is $T\ge n/(m+n)$. (since $g'(T)=T^{n-1}(1-T)^{m-1}[n-(m+n)T]$)

So $\Lambda\le \lambda_0$ (we will reject $H_0$) is equivalent as $$c_1\le T\le c_2$$ for some constants $0<c_1<c_2$ and it satisfy $$c_1^n(1-c_1)^m=c_2^n(1-c_2)^m$$

For the test level $\alpha$, we also need
$$
P(c_1\le T\le c_2)=\alpha
$$

(the probability that we reject $H_0$).

The distribution of $T$: under $H_0$ we have $\sum X_i\sim Gamma(n,1/\lambda)$ and $\sum X_i+\sum Y_j\sim Gamma(n+m,1/\lambda)$. Then
$$
\frac{m+n}{n}T\sim \frac{\chi^2(2n)/(2n)}{\chi^2(2(m+n))/(2(m+n))}\sim F(2n, 2(m+n)).
$$

But what is the critical region?

Best Answer

Your problem is related to the F-test for equality of variances as the sum of exponential distributions that you have are like variances of normal distributed variables.

We can instead use

$$F = \frac{\bar{Y}}{\bar{X}} = \frac{n}{m} (T^{-1} -1) \sim F(2m,2n)$$

This is the distribution of $F$ if the null hypothesis is correct.

If the hypothesis $\lambda_1 = \lambda_2$ is wrong then the distribution will be like a scaled F-distribution. Or more easily we use Fisher's z- distribution for the statistic $Z = 0.5 \log F$, where the alternative hypothesis is a shift of the distribution.

$$f_Z(z;d_1=2m,d_2=2n) = \frac{2 d_1^{d_1/2}d_2^{d_2/2}}{B(d_1/2,d_2/2)} \frac{e^{d_1 z}}{(d_1e^{2z}+d_2)^{(d_1+d_2)/2}}$$

where $B$ is the beta function.

Why do I suggest to use the statistic $Z$ that follows Fisher's z-distribution?

  • Because the alternative hypothesis $\lambda_1 \neq \lambda_2$ relates to a shift of the distribution and the likelihood ratio is equal to $$\Lambda(z) = \frac{f_Z(z;2m,2n)}{f_Z(0;2m,2n)}$$ $f_Z(z;2m,2n)$ is the likelihood when we use $\lambda_0$ and $f_Z(0;2m,2n)$ (the peak of the z-distribution) is the likelihood when we use independent $\lambda_1 = 1/\bar{X}$ and $\lambda_2 = 1/\bar{Y}$.

  • The effect is that the critical region for the statistic $Z$ can be found by using the highest density region for the distribution $f_Z$

Demonstration with code:

Say that $m=1$ and $n=5$, then the boundaries are $$\begin{array}{rcccl} &&Z & \in& [-1.628 , 0.971] \\ e^{2Z}&=& F& \in& [0.03857, 6.98385] \\ (1+\frac{m}{n}F)^{-1}&=& T& \in& [0.4172 , 0.9923] \end{array}$$

example with m=1 and n=5

m = 1
n = 5
set.seed(1)


### Fisher's z-distribution density
dz = function(z,d1,d2) {   
  2*d1^(d1/2) * d2^(d2/2) * exp(d1*z) / beta(d1/2,d2/2) / (d1*exp(2*z)+d2)^((d1+d2)/2)
}
dz = Vectorize(dz)

### compute and plot null-distribution of z
z = seq(-4,4,0.0001)
delta = 0.0001
f = dz(z,2*m,2*n)
plot(z,f, type = "l", main = "null-distribution of z \n with 95% highest density boundary", lwd = 2)

### compute 95% highest density region 
### by ranking the densities 
### and select the lowest that sum up to 5%
ord = order(f)
rejectregion = which(cumsum(f[ord]*delta)<=0.05)
zb = range(z[ord][-rejectregion])

lines(c(zb[1],zb[1]),c(0,1), lty = 2, col = 2, lwd = 2)
lines(c(zb[2],zb[2]),c(0,1), lty = 2, col = 2, lwd = 2)

###### computation of boundary values
###### for different statistics z, F and T

### -1.6276  0.9718
zb

### 0.03857311 6.98384762
F = exp(2*zb)
F

### 0.9923444 0.4172283
T = 1/(1+m/n*F)
T

### check with the formula
### both are around 0.007367
T[1]^n*(1-T[1])^m
T[2]^n*(1-T[2])^m


### computational test

sample = function(m,n) {
   X = sum(rexp(n,1))
   Y = sum(rexp(m,1))
   T = X/(X+Y)
   return(T)
}

Tsample = replicate(10^5,sample(m,n))
hist(Tsample, breaks = seq(0,1,0.01), main = "histogram of T with 95% boundary lines")

lines(c(T[1],T[1]),c(0,100000), lty = 2, col = 2, lwd = 2)
lines(c(T[2],T[2]),c(0,100000), lty = 2, col = 2, lwd = 2)


x = seq(0,1,0.01)
lines(x,dbeta(x,n,m)*10^5/100,col = 3)

### check 95% 0.9496603
pbeta(T[1],n,m)-pbeta(T[2],n,m)