Logistic – How Does the Power of Logistic Regression and a t-Test Compare?

logisticstatistical-powert-test

Is the power of a logistic regression and a t-test equivalent? If so, they should be "data density equivalent" by which I mean that the same number of underlying observations yields the same power given a fixed alpha of .05. Consider two cases:

[The parametric t-test]: 30 draws from a binomial observation are made and the resulting values are averaged. This is done 30 times for group A (which has a binomial Pr of .70 of occurring) and 30 times for group B (which has a binomial Pr of .75 of occurring). This yields 30 means per group that represent a summary of 1,800 draws from a binomial distribution. A 58df t-test is performed to compare the means.
[The logistic regression]: A logistic regression is performed with a dummy coded slope representing group membership and each of the 1,800 draws.

My question has two parts:

Given a set alpha of .05, will the power of these methodologies be the same or different? Why? How can I prove it?
Is the answer to question 1 is sensitive to the sample sizes going into the t-test, sample size of each group in the t-test, underlying binomial probabilities, or some other factor? If so, how can I know (without simulation) that the power is indeed different and what sort of changes will produce what sort of changes in power? Alternatively, provide worked out R code that solves the issue using simulation.

Best Answer

If I have computed correctly, logistic regression asymptotically has the same power as the t-test. To see this, write down its log likelihood and compute the expectation of its Hessian at its global maximum (its negative estimates the variance-covariance matrix of the ML solution). Don't bother with the usual logistic parameterization: it's simpler just to parameterize it with the two probabilities in question. The details will depend on exactly how you test the significance of a logistic regression coefficient (there are several methods).

That these tests have similar powers should not be too surprising, because the chi-square theory for ML estimates is based on a normal approximation to the log likelihood, and the t-test is based on a normal approximation to the distributions of proportions. The crux of the matter is that both methods make the same estimates of the two proportions and both estimates have the same standard errors.

An actual analysis might be more convincing. Let's adopt some general terminology for the values in a given group (A or B):

$p$ is the probability of a 1.
$n$ is the size of each set of draws.
$m$ is the number of sets of draws.
$N = m n$ is the amount of data.
$k_{ij}$ (equal to $0$ or $1$) is the value of the $j^\text{th}$ result in the $i^\text{th}$ set of draws.
$k_i$ is the total number of ones in the $i^\text{th}$ set of draws.
$k$ is the total number of ones.

Logistic regression is essentially the ML estimator of $p$. Its logarithm is given by

$$\log(\mathbb{L}) = k \log(p) + (N-k) \log(1-p).$$

Its derivatives with respect to the parameter $p$ are

$$\frac{\partial \log(\mathbb{L})}{ \partial p} = \frac{k}{p} - \frac{N-k}{1-p} \text{ and}$$

$$-\frac{\partial^2 \log(\mathbb{L})}{\partial p^2} = \frac{k}{p^2} + \frac{N-k}{(1-p)^2}.$$

Setting the first to zero yields the ML estimate ${\hat{p} = k/N}$ and plugging that into the reciprocal of the second expression yields the variance $\hat{p}(1 - \hat{p})/N$, which is the square of the standard error.

The t statistic will be obtained from estimators based on the data grouped by sets of draws; namely, as the difference of the means (one from group A and the other from group B) divided by the standard error of that difference, which is obtained from the standard deviations of the means. Let's look at the mean and standard deviation for a given group, then. The mean equals $k/N$, which is identical to the ML estimator $\hat{p}$. The standard deviation in question is the standard deviation of the draw means; that is, it is the standard deviation of the set of $k_i/n$. Here is the crux of the matter, so let's explore some possibilities.

Suppose the data aren't grouped into draws at all: that is, $n = 1$ and $m = N$. The $k_{i}$ are the draw means. Their sample variance equals $N/(N-1)$ times $\hat{p}(1 - \hat{p})$. From this it follows that the standard error is identical to the ML standard error apart from a factor of $\sqrt{N/(N-1)}$, which is essentially $1$ when $N = 1800$. Therefore--apart from this tiny difference--any tests based on logistic regression will be the same as a t-test and we will achieve essentially the same power.
When the data are grouped, the (true) variance of the $k_i/n$ equals $p(1-p)/n$ because the statistics $k_i$ represent the sum of $n$ Bernoulli($p$) variables, each with variance $p(1-p)$. Therefore the expected standard error of the mean of $m$ of these values is the square root of $p(1-p)/n/m = p(1-p)/N$, just as before.

Number 2 indicates the power of the test should not vary appreciably with how the draws are apportioned (that is, with how $m$ and $n$ are varied subject to $m n = N$), apart perhaps from a fairly small effect from the adjustment in the sample variance (unless you were so foolish as to use extremely few sets of draws within each group).

Limited simulations to compare $p = 0.70$ to $p = 0.74$ (with 10,000 iterations apiece) involving $m = 900, n = 1$ (essentially logistic regression); $m = n = 30$; and $m = 2, n = 450$ (maximizing the sample variance adjustment) bear this out: the power (at $\alpha = 0.05$, one-sided) in the first two cases is 0.59 whereas in the third, where the adjustment factor makes a material change (there are now just two degrees of freedom instead of 1798 or 58), it drops to 0.36. Another test comparing $p = 0.50$ to $p = 0.52$ gives powers of 0.22, 0.21, and 0.15, respectively: again, we observe only a slight drop from no grouping into draws (=logistic regression) to grouping into 30 groups and a substantial drop down to just two groups.

The morals of this analysis are:

You don't lose much when you partition your $N$ data values into a large number $m$ of relatively small groups of "draws".
You can lose appreciable power using small numbers of groups ($m$ is small, $n$--the amount of data per group--is large).
You're best off not grouping your $N$ data values into "draws" at all. Just analyze them as-is (using any reasonable test, including logistic regression and t-testing).

Preliminaries:

As discussed in the G*Power manual, there are several different types of power analyses, depending on what you want to solve for. (That is, $N$, the effect size $ES$, $\alpha$, and power exist in relation to each other; specifying any three of them will let you solve for the fourth.)
- in your description, you want to know the appropriate $N$ to capture the response rates you specified with $\alpha=.05$, and power = 80%. This is a-priori power.
- we can start with post-hoc power (determine power given $N$, response rates, & alpha) as this is conceptually simpler, and then move up
In addition to @GregSnow's excellent post, another really great guide to simulation-based power analyses on CV can be found here: Calculating statistical power. To summarize the basic ideas:
1. figure out the effect you want to be able to detect
2. generate N data from that possible world
3. run the analysis you intend to conduct over those faux data
4. store whether the results are 'significant' according to your chosen alpha
5. repeat many ($B$) times & use the % 'significant' as an estimate of (post-hoc) power at that $N$
6. to determine a-priori power, search over possible $N$'s to find the value that yields your desired power
Whether you will find significance on a particular iteration can be understood as the outcome of a Bernoulli trial with probability $p$ (where $p$ is the power). The proportion found over $B$ iterations allows us to approximate the true $p$. To get a better approximation, we can increase $B$, although this will also make the simulation take longer.
In R, the primary way to generate binary data with a given probability of 'success' is ?rbinom
- E.g. to get the number of successes out of 10 Bernoulli trials with probability p, the code would be rbinom(n=10, size=1, prob=p), (you will probably want to assign the result to a variable for storage)
- you can also generate such data less elegantly by using ?runif, e.g., ifelse(runif(1)<=p, 1, 0)
- if you believe the results are mediated by a latent Gaussian variable, you could generate the latent variable as a function of your covariates with ?rnorm, and then convert them into probabilities with pnorm() and use those in your rbinom() code.
You state that you will "include a polynomial term Var1*Var1) to account for any curvature". There is a confusion here; polynomial terms can help us account for curvature, but this is an interaction term--it will not help us in this way. Nonetheless, your response rates require us to include both squared terms and interaction terms in our model. Specifically, your model will need to include: $var1^2$, $var1*var2$, and $var1^2*var2$, beyond the basic terms.
Although written in the context of a different question, my answer here: Difference between logit and probit models has a lot of basic information about these types of models.
Just as there are different kinds of Type I error rates when there are multiple hypotheses (e.g., per-contrast error rate, familywise error rate, & per-family error rate), so are there different kinds of power* (e.g., for a single pre-specified effect, for any effect, & for all effects). You could also seek for the power to detect a specific combination of effects, or for the power of a simultaneous test of the model as a whole. My guess from your description of your SAS code is that it is looking for the latter. However, from your description of your situation, I am assuming you want to detect the interaction effects at a minimum.
- *reference: Maxwell, S.E. (2004). The persistence of underpowered studies in psychological research: causes, consequences, and remedies. Psychological Methods, 9, 2, pp. 147-163.
- your effects are quite small (not to be confused with the low response rates), so we will find it difficult to achieve good power.
- Note that, although these all sound fairly similar, they are very much not the same (e.g., it is very possible to get a significant model with no significant effects--discussed here: How can a regression be significant yet all predictors be non-significant?, or significant effects but where the model is not significant--discussed here: Significance of coefficients in linear regression: significant t-test vs non-significant F-statistic), which will be illustrated below.
For a different way to think about issues related to power, see my answer here: How to report general precision in estimating correlations within a context of justifying sample size.

Simple post-hoc power for logistic regression in R:

Let's say your posited response rates represent the true situation in the world, and that you had sent out 10,000 letters. What is the power to detect those effects? (Note that I am famous for writing "comically inefficient" code, the following is intended to be easy to follow rather than optimized for efficiency; in fact, it's quite slow.)

set.seed(1)

repetitions = 1000
N = 10000
n = N/8
var1  = c(   .03,    .03,    .03,    .03,    .06,    .06,    .09,   .09)
var2  = c(     0,      0,      0,      1,      0,      1,      0,     1)
rates = c(0.0025, 0.0025, 0.0025, 0.00395, 0.003, 0.0042, 0.0035, 0.002)

var1    = rep(var1, times=n)
var2    = rep(var2, times=n)
var12   = var1**2
var1x2  = var1 *var2
var12x2 = var12*var2

significant = matrix(nrow=repetitions, ncol=7)

startT = proc.time()[3]
for(i in 1:repetitions){
  responses          = rbinom(n=N, size=1, prob=rates)
  model              = glm(responses~var1+var2+var12+var1x2+var12x2, 
                           family=binomial(link="logit"))
  significant[i,1:5] = (summary(model)$coefficients[2:6,4]<.05)
  significant[i,6]   = sum(significant[i,1:5])
  modelDev           = model$null.deviance-model$deviance
  significant[i,7]   = (1-pchisq(modelDev, 5))<.05
}
endT = proc.time()[3]
endT-startT

sum(significant[,1])/repetitions      # pre-specified effect power for var1
[1] 0.042
sum(significant[,2])/repetitions      # pre-specified effect power for var2
[1] 0.017
sum(significant[,3])/repetitions      # pre-specified effect power for var12
[1] 0.035
sum(significant[,4])/repetitions      # pre-specified effect power for var1X2
[1] 0.019
sum(significant[,5])/repetitions      # pre-specified effect power for var12X2
[1] 0.022
sum(significant[,7])/repetitions      # power for likelihood ratio test of model
[1] 0.168
sum(significant[,6]==5)/repetitions   # all effects power
[1] 0.001
sum(significant[,6]>0)/repetitions    # any effect power
[1] 0.065
sum(significant[,4]&significant[,5])/repetitions   # power for interaction terms
[1] 0.017

So we see that 10,000 letters doesn't really achieve 80% power (of any sort) to detect these response rates. (I am not sufficiently sure about what the SAS code is doing to be able to explain the stark discrepancy between these approaches, but this code is conceptually straightforward--if slow--and I have spent some time checking it, and I think these results are reasonable.)

Simulation-based a-priori power for logistic regression:

From here the idea is simply to search over possible $N$'s until we find a value that yields the desired level of the type of power you are interested in. Any search strategy that you can code up to work with this would be fine (in theory). Given the $N$'s that are going to be required to capture such small effects, it is worth thinking about how to do this more efficiently. My typical approach is simply brute force, i.e. to assess each $N$ that I might reasonably consider. (Note however, that I would typically only consider a small range, and I'm typically working with very small $N$'s--at least compared to this.)

Instead, my strategy here was to bracket possible $N$'s to get a sense of what the range of powers would be. Thus, I picked an $N$ of 500,000 and re-ran the code (initiating the same seed, n.b. this took an hour and a half to run). Here are the results:

sum(significant[,1])/repetitions      # pre-specified effect power for var1
[1] 0.115
sum(significant[,2])/repetitions      # pre-specified effect power for var2
[1] 0.091
sum(significant[,3])/repetitions      # pre-specified effect power for var12
[1] 0.059
sum(significant[,4])/repetitions      # pre-specified effect power for var1X2
[1] 0.606
sum(significant[,5])/repetitions      # pre-specified effect power for var12X2
[1] 0.913
sum(significant[,7])/repetitions      # power for likelihood ratio test of model
[1] 1
sum(significant[,6]==5)/repetitions   # all effects power
[1] 0.005
sum(significant[,6]>0)/repetitions    # any effect power
[1] 0.96
sum(significant[,4]&significant[,5])/repetitions   # power for interaction terms
[1] 0.606

We can see from this that the magnitude of your effects varies considerably, and thus your ability to detect them varies. For example, the effect of $var1^2$ is particularly difficult to detect, only being significant 6% of the time even with half a million letters. On the other hand, the model as a whole was always significantly better than the null model. The other possibilities are arrayed in between. Although most of the 'data' are thrown away on each iteration, a good bit of exploration is still possible. For example, we could use the significant matrix to assess the correlations between the probabilities of different variables being significant.

I should note in conclusion, that due to the complexity and large $N$ entailed in your situation, this was not as simple as I had suspected / claimed in my initial comment. However, you can certainly get the idea for how this can be done in general, and the issues involved in power analysis, from what I've put here. HTH.

Solved – Logistic regression or T test

Both tests implicitly model the age-response relationship, but they do so in different ways. Which one to select depends on how you choose to model that relationship. Your choice ought to depend on an underlying theory, if there is one; on what kind of information you want to extract from the results; and on how the sample is selected. This answer discusses these three aspects in order.

I will describe the t-test and logistic regression using language that supposes you are studying a well-defined population of people and wish to make inferences from the sample to this population.

In order to support any kind of statistical inference we must assume the sample is random.

A t-test assumes the people in the sample responding "no" are a simple random sample of all no-respondents in the population and that the people in the sample responding "yes" are a simple random sample of all yes-respondents in the population.

A t-test makes additional technical assumptions about the distributions of the ages within each of the two groups in the population. Various versions of the t-test exist to handle the likely possibilities.
Logistic regression assumes all people of any given age are a simple random sample of the people of that age in the population. The separate age groups may exhibit different rates of "yes" responses. These rates, when expressed as log odds (rather than as straight proportions), are assumed to be linearly related with age (or with some determined functions of age).

Logistic regression is easily extended to accommodate non-linear relationships between age and response. Such an extension can be used to evaluate the plausibility of the initial linear assumption. It is practicable with large datasets, which afford enough detail to display non-linearities, but is unlikely to be of much use with small datasets. A common rule of thumb--that regression models should have ten times as many observations as parameters--suggests that substantially more than 20 observations are needed to detect nonlinearity (which needs a third parameter in addition to the intercept and slope of a linear function).

A t-test detects whether the average ages differ between no-and yes-respondents in the population. A logistic regression estimates how the response rate varies by age. As such it is more flexible and capable of supplying more detailed information than the t-test is. On the other hand, it tends to be less powerful than the t-test for the basic purpose of detecting a difference between the average ages in the groups.

It is possible for the pair of tests to exhibit all four combinations of significance and non-significance. Two of these are problematic:

The t-test is not significant but the logistic regression is. When the assumptions of both tests are plausible, such a result is practically impossible, because the t-test is not trying to detect such a specific relationship as posited by logistic regression. However, when that relationship is sufficiently nonlinear to cause the oldest and youngest subjects to share one opinion and the middle-aged subjects another, then the extension of logistic regression to nonlinear relationships can detect and quantify that situation, which no t-test could detect.
The t-test is significant but the logistic regression is not, as in the question. This often happens, especially when there is a group of younger respondents, a group of older respondents, and few people in between. This may create a great separation between the response rates of no- and yes-responders. It is readily detected by the t-test. However, logistic regression would either have relatively little detailed information about how the response rate actually changes with age or else it would have inconclusive information: the case of "complete separation" where all older people respond one way and all younger people another way--but in that case both tests would usually have very low p-values.

Note that the experimental design can invalidate some of the test assumptions. For instance, if you selected people according to their age in a stratified design, then the t-test's assumption (that each group reflects a simple random sample of ages) becomes questionable. This design would suggest relying on logistic regression. If instead you had two pools, one of no-responders and one of yes-responders, and selected randomly from those to ascertain their age, then the sampling assumptions of logistic regression are doubtful while those of the t-test will hold. That design would suggest using some form of a t-test.

(The second design might seem silly here, but in circumstances where "age" is replaced by some characteristic that is difficult, costly, or time-consuming to measure it can be appealing.)

Best Answer

Related Solutions

Statistical Power – Simulation of Logistic Regression Power Analysis in Designed Experiments

Preliminaries:

Simple post-hoc power for logistic regression in R:

Simulation-based a-priori power for logistic regression:

Solved – Logistic regression or T test

Related Question