Solved – Logistic regression or T test

group-differenceslogisticregressiont-test

A group of persons answers one question. The answer can be "yes" or "no". The researcher wants to know whether age is associated with the type of answer.

The association was assessed by doing a logistic regression where age is the explanatory variable and type of answer (yes, no) is the dependent variable. It was separately addressed by calculating the mean age of the groups that answered "yes" and "no", respectively, and by conducting a T test to compare means.

Both tests were performed following the advice of different persons, and neither of them is sure which is the right way to go. In view of the research question, which would be the better test?

For hypothesis testing the p values were not significant (regression) and significant (T test). The sample is less than 20 cases.

Best Answer

Both tests implicitly model the age-response relationship, but they do so in different ways. Which one to select depends on how you choose to model that relationship. Your choice ought to depend on an underlying theory, if there is one; on what kind of information you want to extract from the results; and on how the sample is selected. This answer discusses these three aspects in order.

I will describe the t-test and logistic regression using language that supposes you are studying a well-defined population of people and wish to make inferences from the sample to this population.

In order to support any kind of statistical inference we must assume the sample is random.

A t-test assumes the people in the sample responding "no" are a simple random sample of all no-respondents in the population and that the people in the sample responding "yes" are a simple random sample of all yes-respondents in the population.

A t-test makes additional technical assumptions about the distributions of the ages within each of the two groups in the population. Various versions of the t-test exist to handle the likely possibilities.
Logistic regression assumes all people of any given age are a simple random sample of the people of that age in the population. The separate age groups may exhibit different rates of "yes" responses. These rates, when expressed as log odds (rather than as straight proportions), are assumed to be linearly related with age (or with some determined functions of age).

Logistic regression is easily extended to accommodate non-linear relationships between age and response. Such an extension can be used to evaluate the plausibility of the initial linear assumption. It is practicable with large datasets, which afford enough detail to display non-linearities, but is unlikely to be of much use with small datasets. A common rule of thumb--that regression models should have ten times as many observations as parameters--suggests that substantially more than 20 observations are needed to detect nonlinearity (which needs a third parameter in addition to the intercept and slope of a linear function).

A t-test detects whether the average ages differ between no-and yes-respondents in the population. A logistic regression estimates how the response rate varies by age. As such it is more flexible and capable of supplying more detailed information than the t-test is. On the other hand, it tends to be less powerful than the t-test for the basic purpose of detecting a difference between the average ages in the groups.

It is possible for the pair of tests to exhibit all four combinations of significance and non-significance. Two of these are problematic:

The t-test is not significant but the logistic regression is. When the assumptions of both tests are plausible, such a result is practically impossible, because the t-test is not trying to detect such a specific relationship as posited by logistic regression. However, when that relationship is sufficiently nonlinear to cause the oldest and youngest subjects to share one opinion and the middle-aged subjects another, then the extension of logistic regression to nonlinear relationships can detect and quantify that situation, which no t-test could detect.
The t-test is significant but the logistic regression is not, as in the question. This often happens, especially when there is a group of younger respondents, a group of older respondents, and few people in between. This may create a great separation between the response rates of no- and yes-responders. It is readily detected by the t-test. However, logistic regression would either have relatively little detailed information about how the response rate actually changes with age or else it would have inconclusive information: the case of "complete separation" where all older people respond one way and all younger people another way--but in that case both tests would usually have very low p-values.

Note that the experimental design can invalidate some of the test assumptions. For instance, if you selected people according to their age in a stratified design, then the t-test's assumption (that each group reflects a simple random sample of ages) becomes questionable. This design would suggest relying on logistic regression. If instead you had two pools, one of no-responders and one of yes-responders, and selected randomly from those to ascertain their age, then the sampling assumptions of logistic regression are doubtful while those of the t-test will hold. That design would suggest using some form of a t-test.

(The second design might seem silly here, but in circumstances where "age" is replaced by some characteristic that is difficult, costly, or time-consuming to measure it can be appealing.)

Related Solutions

Solved – Question about logistic regression

This is actually an extremely sophisticated problem and a tough ask from your lecturer!

In terms of how you organise your data, a 1070 x 10 rectangle is fine. For example, in R:

> conflict.data <- data.frame(
+ confl = sample(0:1, 1070, replace=T),
+ country = factor(rep(1:107,10)),
+ period = factor(rep(1:10, rep(107,10))),
+ landdeg = sample(c("Type1", "Type2"), 1070, replace=T),
+ popincrease = sample(0:1, 1070, replace=T),
+ liveli =sample(0:1, 1070, replace=T),
+ popden = sample(c("Low", "Med", "High"), 1070, replace=T),
+ NDVI = rnorm(1070,100,10),
+ NDVIdecl1 = sample(0:1, 1070, replace=T),
+ NDVIdecl2 = sample(0:1, 1070, replace=T))
> head(conflict.data)
  confl country period landdeg popincrease liveli popden     NDVI NDVIdecl1 NDVIdecl2
1     1       1      1   Type1           1      0    Low 113.4744         0         1
2     1       2      1   Type2           1      1   High 103.2979         0         0
3     0       3      1   Type2           1      1    Med 109.1200         1         1
4     1       4      1   Type2           0      1    Low 112.1574         1         0
5     0       5      1   Type1           0      0   High 109.9875         0         1
6     1       6      1   Type1           1      0    Low 109.2785         0         0
> summary(conflict.data)
     confl           country         period     landdeg     popincrease         liveli        popden         NDVI          NDVIdecl1        NDVIdecl2     
 Min.   :0.0000   1      :  10   1      :107   Type1:535   Min.   :0.0000   Min.   :0.0000   High:361   Min.   : 68.71   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   2      :  10   2      :107   Type2:535   1st Qu.:0.0000   1st Qu.:0.0000   Low :340   1st Qu.: 93.25   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :1.0000   3      :  10   3      :107               Median :1.0000   Median :1.0000   Med :369   Median : 99.65   Median :1.0000   Median :0.0000  
 Mean   :0.5009   4      :  10   4      :107               Mean   :0.5028   Mean   :0.5056              Mean   : 99.84   Mean   :0.5121   Mean   :0.4888  
 3rd Qu.:1.0000   5      :  10   5      :107               3rd Qu.:1.0000   3rd Qu.:1.0000              3rd Qu.:106.99   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :1.0000   6      :  10   6      :107               Max.   :1.0000   Max.   :1.0000              Max.   :130.13   Max.   :1.0000   Max.   :1.0000  
                  (Other):1010   (Other):428                                                                                                              
> dim(conflict.data)
[1] 1070   10

For fitting a model, the glm() function as @gui11aume suggests will do the basics...

mod <- glm(confl~., family="binomial", data=conflict.data)
anova(mod)

... but this has the problem that it treats "country" (I'm assuming you have country as your 107 units) as a fixed effect, whereas a random effect is more appropriate. It also treats period as a simple factor, no autocorrelation allowed.

You can address the first problem with a generalized linear mixed effects model as in eg Bates et al's lme4 package in R. There's a nice introduction to some aspects of this here. Something like

library(lme4)
mod2 <- lmer(confl ~ landdeg + popincrease + liveli + popden + 
    NDVI + NDVIdecl1 + NDVIdecl2 + (1|country) +(1|period), family=binomial,
    data=conflict.data)
summary(mod2)

would be a step forward.

Now your last remaining problem is autocorrelation across your 10 periods. Basically, your 10 data points on each country aren't worth as much as if they were 10 randomly chosen independent and identicall distributed points. I'm not aware of a widely available software solution to autocorrelation in the residuals of a multilevel model with a non-Normal response. Certainly it isn't implemented in lme4. Others may know more than me.

Logistic Regression – How to Determine the Association Between Covariates and Treatment Group

Without any further detail, what you are doing is re-inventing the issue of p-values in table 1. Turkiewicz et al. say the following here

Similarly, a P-value >0.05 can never be used to support a statement that the null hypothesis is true (often expressed as “there was no difference …”) because absence of evidence is not evidence of absence. It is important to recognize that the P-value is a measure for inferential purposes, not descriptive ones. Thus, P-values in ‘Table 1’ (which usually describes the study sample) are useless.

Predictiveness in a general sense is almost identical to statistical significance testing - a claim that's too short to show in this answer - and testing for the statistical significance of a covariate in a bivariate logistic regression model amounts to running any type of categorical analysis, like a Pearson Chi-square test of independence.

The problem with using cross-validation to assess this is that you still have not controlled for multiple comparisons. Three covariates of age, sex, and BMI ``tested'' against a randomization assignment for balance will have a family-wise false positive error rate of $1-(1-0.05)^3 \approx 15\%$. So you would, at a minimum need to apply a correction, such as Bonferroni, which leaves the reader to ask how much power you actually have to detect imbalance with a method such as this?

Dr. Stephen Senn nicely summarizes the issue of ``obsessing with balance'' here and here. To summarize: any attention given to inspecting the relation of a covariate to the randomization assignment is a complete lost cause. However, covariates which are known to be strong prognostic factors should be adjusted regardless of whether they're balanced or not, provided the study has sufficient power to do so.

Best Answer

Related Solutions

Solved – Question about logistic regression

Logistic Regression – How to Determine the Association Between Covariates and Treatment Group

Related Question