Solved – Logistic regression or T test

group-differenceslogisticregressiont-test

A group of persons answers one question. The answer can be "yes" or "no". The researcher wants to know whether age is associated with the type of answer.

The association was assessed by doing a logistic regression where age is the explanatory variable and type of answer (yes, no) is the dependent variable. It was separately addressed by calculating the mean age of the groups that answered "yes" and "no", respectively, and by conducting a T test to compare means.

Both tests were performed following the advice of different persons, and neither of them is sure which is the right way to go. In view of the research question, which would be the better test?

For hypothesis testing the p values were not significant (regression) and significant (T test). The sample is less than 20 cases.

Best Answer

Both tests implicitly model the age-response relationship, but they do so in different ways. Which one to select depends on how you choose to model that relationship. Your choice ought to depend on an underlying theory, if there is one; on what kind of information you want to extract from the results; and on how the sample is selected. This answer discusses these three aspects in order.


I will describe the t-test and logistic regression using language that supposes you are studying a well-defined population of people and wish to make inferences from the sample to this population.

In order to support any kind of statistical inference we must assume the sample is random.

  • A t-test assumes the people in the sample responding "no" are a simple random sample of all no-respondents in the population and that the people in the sample responding "yes" are a simple random sample of all yes-respondents in the population.

    A t-test makes additional technical assumptions about the distributions of the ages within each of the two groups in the population. Various versions of the t-test exist to handle the likely possibilities.

  • Logistic regression assumes all people of any given age are a simple random sample of the people of that age in the population. The separate age groups may exhibit different rates of "yes" responses. These rates, when expressed as log odds (rather than as straight proportions), are assumed to be linearly related with age (or with some determined functions of age).

    Logistic regression is easily extended to accommodate non-linear relationships between age and response. Such an extension can be used to evaluate the plausibility of the initial linear assumption. It is practicable with large datasets, which afford enough detail to display non-linearities, but is unlikely to be of much use with small datasets. A common rule of thumb--that regression models should have ten times as many observations as parameters--suggests that substantially more than 20 observations are needed to detect nonlinearity (which needs a third parameter in addition to the intercept and slope of a linear function).

A t-test detects whether the average ages differ between no-and yes-respondents in the population. A logistic regression estimates how the response rate varies by age. As such it is more flexible and capable of supplying more detailed information than the t-test is. On the other hand, it tends to be less powerful than the t-test for the basic purpose of detecting a difference between the average ages in the groups.

It is possible for the pair of tests to exhibit all four combinations of significance and non-significance. Two of these are problematic:

  • The t-test is not significant but the logistic regression is. When the assumptions of both tests are plausible, such a result is practically impossible, because the t-test is not trying to detect such a specific relationship as posited by logistic regression. However, when that relationship is sufficiently nonlinear to cause the oldest and youngest subjects to share one opinion and the middle-aged subjects another, then the extension of logistic regression to nonlinear relationships can detect and quantify that situation, which no t-test could detect.

  • The t-test is significant but the logistic regression is not, as in the question. This often happens, especially when there is a group of younger respondents, a group of older respondents, and few people in between. This may create a great separation between the response rates of no- and yes-responders. It is readily detected by the t-test. However, logistic regression would either have relatively little detailed information about how the response rate actually changes with age or else it would have inconclusive information: the case of "complete separation" where all older people respond one way and all younger people another way--but in that case both tests would usually have very low p-values.

Note that the experimental design can invalidate some of the test assumptions. For instance, if you selected people according to their age in a stratified design, then the t-test's assumption (that each group reflects a simple random sample of ages) becomes questionable. This design would suggest relying on logistic regression. If instead you had two pools, one of no-responders and one of yes-responders, and selected randomly from those to ascertain their age, then the sampling assumptions of logistic regression are doubtful while those of the t-test will hold. That design would suggest using some form of a t-test.

(The second design might seem silly here, but in circumstances where "age" is replaced by some characteristic that is difficult, costly, or time-consuming to measure it can be appealing.)

Related Question