Solved – Should sampling for logistic regression reflect the real ratio of 1’s and 0’s

logisticsampling

Suppose I want to create logistic regression model which can estimate a probability of occurrence of some animal species living on trees based on characteristics of trees (f.e. height). As always, my time and money are limited, therefore I am able to collect only a limited sample size.

I have following questions:
Should the ratio of 1's and 0's in my sample reflect the true ratio of 1's and 0's? (at least approximately) I noticed that it is a common practice to perform logistic regression model with balanced sample (equal number of 1's and 0's) – but such models give surrealistically high probability of occurrence – right?

Are there any articles/textbook which I can use to **support the notion, that the models which do not reflect the true ratio of 1's and 0's are "wrong"?**

And finally: Is it possible to perform 1:1 sampling and subsequently correct the model with tau according to Imai et al. 2007?

Kosuke Imai, Gary King, and Olivia Lau. 2007. “relogit: Rare Events
Logistic Regression for Dichotomous Dependent Variables,” in Kosuke
Imai, Gary King, and Olivia Lau, “Zelig: Everyone’s Statistical
Software,” http: //gking.harvard.edu/zelig.

enter image description here

Dots represent trees (red = occupied, grey = unoccupied). I am able to
identify all occupied trees with 100% accuracy (1's) but I can not
measure all trees in forest. The model is different for each sampling
strategy (ratio).

Best Answer

If the goal of such a model is prediction, then you cannot use unweighted logistic regression to predict outcomes: you will overpredict risk. The strength of logistic models is that the odds ratio (OR)--the "slope" which measures association between a risk factor and a binary outcome in a logistic model--is invariant to outcome dependent sampling. So if cases are sampled in a 10:1, 5:1, 1:1, 5:1, 10:1 ratio to controls, it simply doesn't matter: the OR remains unchanged in either scenario so long as sampling is unconditional on the exposure (which would introduce Berkson's bias). Indeed, outcome dependent sampling is a cost saving endeavor when complete simple random sampling is just not gonna happen.

Why are risk predictions biased from outcome dependent sampling using logistic models? Outcome dependent sampling impacts the intercept in a logistic model. This causes the S-shaped curve of association to "slide up the x-axis" by the difference in the log-odds of sampling a case in a simple random sample in the population and the log-odds of sampling a case in a pseudo-population of your experimental design. (So if you have 1:1 cases to controls, there is a 50% chance of sampling a case in this pseudo population). In rare outcomes, this is quite a big difference, a factor of 2 or 3.

When you speak of such models being "wrong" then, you must focus on whether the objective is inference (right) or prediction (wrong). This also addresses the ratio of outcomes to cases. The language you tend to see around this topic is that of calling such a study a "case control" study, which has been written about extensively. Perhaps my favorite publication on the topic is Breslow and Day which as a landmark study characterized risk factors for rare causes of cancer (previously infeasible due to rarity of the events). Case control studies spark some controversy surrounding the frequent misinterpretation of findings: particularly conflating the OR with the RR (exaggerates findings) and also the "study base" as an intermediary of the sample and the population which enhances findings. Miettenen provides an excellent criticism of them. No critique, however, has claimed case-control studies are inherently invalid, I mean how could you? They've advanced public health in innumerable avenues. Miettenen's article is good at pointing out that, you can even use relative risk models or other models in outcome dependent sampling and describe the discrepancies between the results and population level findings in most cases: it's not really worse since the OR is typically a hard parameter to interpret.

Probably the best and easiest way to overcome the oversampling bias in risk predictions is by using weighted likelihood. Scott and Wild discuss weighting and show it corrects the intercept term and the model's risk predictions. This is the best approach when there is a priori knowledge about the proportion of cases in the population. If the prevalence of the outcome is actually 1:100 and you sample cases to controls in a 1:1 fashion, you simply weight controls by a magnitude of 100 to obtain population consistent parameters and unbiased risk predictions. The downside to this method is that it doesn't account for uncertainty in the population prevalence if it has been estimated with error elsewhere. This is a huge area of open research, Lumley and Breslow came very far with some theory about two phase sampling and the doubly robust estimator. I think it's tremendously interesting stuff. Zelig's program seems to simply be an implementation of the weight feature (which seems a bit redundant as R's glm function allows for weights).

Related Question