Likelihood Ratio Test – Understanding Likelihood Ratio Test for Specific Distributions in Hypothesis Testing

hypothesis testingintuitionlikelihood-ratio

I'm just now being introduced to likelihood-ratio tests (LRT), and I am having trouble following the concept and terminology.

For example, I posed a question about determining whether two samples {x} and {y} came from the same probability distribution, meaning that the distribution's parameter on the two variables sampled is the same: i.e., $\sigma_x = \sigma_y$. This answer to that question refers to "the likelihood-ratio test for the" distribution, and later "the likelihood ratio calculated here under the [specific distribution] assumption." I can't figure out what exact calculations either of those phrases references.

Now, I'm used to thinking in hypothesis terms, so if we say that the null hypothesis $H_0$ is the scenario in which $\sigma_x = \sigma_y$, then its complement $H_1$ is the scenario in which $\sigma_x \neq \sigma_y$. I think I can handle that (minus ambiguity regarding symmetry of Type I and Type II errors).

So, somehow we can take this question of hypotheses into likelihood-ratio space by comparing the likelihood of $H_0$ given the sample data, written $\mathcal{L}(H_0 | {x,y})$ to … well, the likelihood of observing the sample data under every parameter pair ($\sigma_x, \sigma_y$) in which $\sigma_x \neq \sigma_y$?!

I gather that likelihood ratios express something more powerful than a hypothesis and its complement, but I haven't found an explanation I can follow back to anything concrete that I understand.

Best Answer

A likelihood ratio test is just a particular type of hypothesis test where the test statistic is obtained in a specific way.

They arise out of Neyman and Pearson's attempt to find a way to obtain "good" test statistics (in the sense of the resulting tests having high power).

So the point of likelihood ratio tests is that - given a model - it's a fairly convenient method to obtain a test statistic that should be efficient or at least pretty close to efficient in a very broad range of situations.

Indeed, for specific situations I won't detail here (I imagine you have probably already found the conditions in your reading), they should be uniformly most powerful. See the Neyman-Pearson lemma and the Karlin-Rubin theorem. However, uniformly most powerful tests won't exist for your two-sided alternative. I don't want to get too far into the woods here, so let's for moment take as read that we're just after a good choice of test statistic, and often the likelihood ratio provides one.

I'll adopt a particular convention (null likelihood on the numerator, which has at least one small advantage) but some people will write the ratio the other way up; the choice makes no difference, mutatis mutandis.

The likelihood for the numerator is evaluated with free parameters replaced by their MLEs under the null (the MLE for the joint $\sigma$ in your Rayleigh) and similarly, the likelihood for the denominator is replaced with free parameters replaced by their MLEs under the alternative (the MLEs of the individual-sample $\sigma$ values). This ratio can then be compared with its sampling distribution under $H_0 \,^{[1]}$ (the rejection region is small values of the likelihood ratio, the left tail of the distribution of $\Lambda$) -- the advantage I mentioned earlier is that area in critical region for left-tail rejection rules is a straight cdf evaluated at the chosen critical value.

Now why does this make intuitive sense? Note that when $H_0$ is a "good explanation" for the data, it will have a likelihood nearly as high as under $H_1$ (in the case we're discussing, where $H_1$ has an extra free parameter that is constrained under $H_0$, it won't have lower likelihood than $H_0$). So if $H_0$ is a good explanation, the likelihood ratio should be near to 1. If $H_0$ is a very poor explanation for the data compared to $H_1$, then the ratio will be very small -- we will end up in the far left tail of the null distribution of $\Lambda$. So as long as likelihood does a good job of summarizing the information in a sample about a parameter (see a good book on statistical inference for why that's very often the case), this should be expected to work quite well as a test.

The only difficulty is in evaluating the distribution of said likelihood ratio under $H_0$.

In large samples we don't particularly care because we can use Wilks' theorem (minus twice the log of the likelihood ratio should be asymptotically chi-squared, with, if expressed somewhat loosely, d.f. equal to the change in the number of free parameters in moving from the null to the alternative, where the null is nested in the alternative).

However, what if we have small samples? In "nice" cases the exact distribution of the likelihood ratio can be derived, or the distribution of a monotonic transformation of it might be straightforward (yielding an equivalent test in the one-tailed case, and sometimes in the two-tailed case -- but typically very nearly equivalent otherwise).


Let's address the LRT for the specific situation in question, random samples from two Rayleigh distributed populations (/data-generating processes).

I suggested looking at the squared-values which will then be exponential. This is not only a little simpler algebraically, it had the advantage of being a case I had already played with. Being a monotonic transformation, it's not going to change anything. To be more precise some things (like rejection regions) will be equivariant to a monotonic transformation and some things (like type I error rates) will be invariant; you can convert back and forth between the exponential and the Rayleigh values at will, the likelihood ratio tests will be equivalent (both will reject or both will not reject, on the same samples).

Now under $H_0$ both samples come from distributions that have the same parameter, so the likelihood of the two samples under $H_0$ is obtained by treating them as a single sample, estimating the parameter via maximum likelihood and evaluating the likelihood of the sample at that parameter value. (In what follows $x$s are values from sample 1 and $y$s are values from sample 2; keep in mind these are squares of the original Rayleigh-distributed values.)

That is, the likelihood in the numerator is $\prod_i f(x_i;\hat{\theta}_0) \cdot \prod_j f(y_j;\hat{\theta}_0)$, where $f$ is the density for the distributional model (e.g. an exponential density), and $\hat{\theta}_0$ is the MLE of the parameter for the combined sample (since under $H_0$ the parameters are the same). That is, you evaluate the density of each data point with the parameter value set to the MLE for the combined sample and take the product of likelihoods across all data points (it's a product because of independence, and we're evaluating joint densities in the first place because of how likelihood is defined).

I did this (or rather, I calculated its log) in my R code on your previous question in the line logL0=sum(dexp(c(x,y),1/m0,log=TRUE)).

translating: c(x,y) simply makes a single data set of the two samples, m0 previously calculated is the combined sample mean so 1/m0 is the MLE of the rate parameter of an exponential, dexp evaluates the (log of, because of log=TRUE) exponential density at each data value. The joint likelihood is the product of the individual likelihoods (because of independence), so the log-likelihood is then the sum of those individual-observation likelihoods.

So that's the numerator.

The denominator has each sample treated separately - one parameter estimate for each sample, their joint likelihood across both samples being the product of the two samples' own joint likelihoods (across observations in that sample).

That is, $\prod_i f(x_i;\hat{\theta}_1) \cdot \prod_j f(y_j;\hat{\theta}_2)$, where $\hat{\theta}_1$ and $\hat{\theta}_2$ are the individual-sample MLEs of the parameters. This is because under $H_1$ the parameters are not held to be the same and so will (almost always, for continuous variables) differ.

In my R code from before, that's the line:
logL1=sum(dexp(x,1/mx,log=TRUE))+sum(dexp(y,1/my,log=TRUE))
(again, evaluating the log of the likelihood, so the products within and across samples (again, we take products because of independence) become sums. Note that mx is just the mean of sample 1, and similarly for my.

So the log of the likelihood ratio is just the difference of the two log-likelihoods (null minus alternative). We have computed the numeric value for the log of $\Lambda$ evaluated at the two samples.

If we want the likelihood ratio itself we could exponentiate to get it, but for several reasons it's usually a good idea to remain on the log-scale.

Now in the exponential-distribution case it turns out that this likelihood ratio is a simple function of the ratio of sample means (unsurprisingly, since the exponential distribution is in the exponential family). In my answer to your previous question I linked another answer and also referred to Scortchi's answer to that linked question; they discuss this right near the top of the answer.

So for a one-tailed test a test based on the ratio of sample means will exactly correspond to the LRT, and we can compute its distribution under $H_0$ (it's an $F$ distribution as discussed in my answer at that question I linked from your other answer).

This saves us having to compute critical values for the likelihood ratio itself (though we can use simulation to do so when it's not tractable, it's quite simple to do).

For a two-tailed test, putting $\alpha/2$ in each tail of the F-statistic for the ratio of means won't correspond exactly to the likelihood ratio test, as mentioned (and demonstrated by simulation), and I mentioned that depending on the situation you might sometimes prefer to use the equal-tail area even if it might be slightly less powerful. In large samples none of this matters much, and in equal-sized samples, I believe the F and LRT are equivalent anyway.

If you wanted to use this test statistic in a permutation test there's no additional difficulty; you just need to be able to evaluate the likelihood ratio statistic (or its log) on your samples and use the same function on pseudo-samples obtained by randomly re-allocating observations to the samples (because of exchangeability under $H_0$).


[1] the distribution of the test statistic is evaluated under $H_0$, as with any typical hypothesis test