When modeling rare events with logistic regression, oversampling is a common method to reduce computation complexity (i.e., keep all the rare positive cases but just a subsample of negative cases). After model fitting, adding a offset to the intercept term is a common method to correct the event probability to reflect the original sample proportion. The offset is equal to log( r1*(1-p1) / (1-r1)*p1 ), where r1 is the proportion of rare events in the oversampled data and p1 is the proportion in the original data. What is the equivalent formula with multinomial logistic regression, where 1 or more classes is oversampled?
Solved – Oversampling correction for multinomial logistic regression
logisticmultinomial-distributionoversamplingregression
Related Solutions
1.
Yes, the proportions do matter. In logistic regression, the estimated coefficients are log-odds, and when you exponentiate the beta coefficients of the independent variables you get the odds ratio, not the relative risk or something similar. Odds ratio, by definition, is similar to relative risk when the event is uncommon but as the event becomes more common, the odds ratio will be greatly distorted compared to the relative risk. I'll give you two examples, and for each example I'll first calculate the odds ratio and relative risk manually, then estimate the odds ratio via logistic regression:
In the first example we compare the association between an uncommon disease (0/1) and sex (0 = female, 1 = male). We'll assume that the disease is twice as common in males than in females. This gives a 2x2 cross table with the following cell counts:
a. Diseased males: 100 b. Diseased females: 50 c. Healthy males: 1900 d. Healthy females: 1950
The relative risk is (a/(a+c)) / (b/(b+d)) = 2. It is twice as common for men to have the disease.
The odds ratio is (a/c)/(c/d) = 2.05. Quite close to the relative risk. Now a logistic regression model gives:
disease <- c(rep(1, 150), rep(0,3850))
sex <- c(rep(1,100), rep(0,50), rep(1,1900), rep(0,1950))
summary(glm(disease ~ sex, family=binomial))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.6636 0.1432 -25.580 < 2e-16 ***
sex 0.7191 0.1762 4.082 4.47e-05 ***
The estimated beta coefficient for sex is 0.7191. When we exponentiate to get the odds ratio we get exp(0.7191) = 2.05.
Ok, so no we'll try with another disease condition that is more common, and still more common in males. We'll assume that 30 percent of females and 60 males have the disease, just to make the results extreme:
a. Diseased males: 1200 b. Diseased females: 600 c. Healthy males: 800 d. Healthy females: 1400
The relative risk is (a/(a+c)) / (b/(b+d)) = 2. So the relative risk is unchanged though this disease is far more common.
The odds ratio is (a/c)/(c/d) = 3.5. So now the odds ratio is clearly distorted, almost twice the relative risk!
Running a logistic regression model (in R) confirms this:
disease <- c(rep(1, 1800), rep(0,2200))
sex <- c(rep(1,1200), rep(0,600), rep(1,800), rep(0,1400))
summary(glm(disease ~ sex, family=binomial))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.84730 0.04879 -17.36 <2e-16 ***
sex 1.25276 0.06682 18.75 <2e-16 ***
exp(1.25276) = 3.5
So yes, the proportions do matter because the (log) odds are estimated, not probability. The relationship between probability and odds is odds = p(1-p), and p = odds(1+odds).
2.
I don't understand the idea to "standardize the proportions". The approach suggest by your colleague may or may not be appropriate, depending on what it is that you want to study. I would like to know more about the dependent variable and independent variables of interest etc.
I might in for a real learning treat here, but it seems to me that you're trying to model a problem using two very different distributions.
Poisson distributed output is integer, positive and unbounded in a sense. Logistic regressions is intended for binary outcomes ie binomial data. The output looks the same at a quick glance, but you have to consider whether you can reasonably define a measure of how many trials you're conducting and assign a probability of success to every trial, in which case you have a binomial distribution.
Consider two examples: 1) model the survival probability of passengers on the Titanic: Binomial. You know the number of passengers in every class, ie the number of distinct trials, and you know how many survived.
2) Model the number of ear infections per year among different kinds of swimmers: Poisson with offset. You DO know the number of swimmers in every group, this is the offset in the Poisson distribution, but you can't reasonably ask how many times you've tested whether a swimmer caught an ear infection or not. You can only summarize once your chosen time interval is up.
It seems to me that you should clarify what kind of output you're looking at, and after reasoning about what you could expect from that output, decide on the correct family of distributions to model from.
If this does not point you in the right direction then I'm very eager to learn some new statistics tricks.
edit: Literature recommendations seems to be anything related to generalized linear models (not to be confused with general linear models).
Best Answer
Off the cuff, I presume one could proceed as in logistic regression: a generalisation to $K>2$ categories and base category $K$ would be to set the $i$-th correction term to be $$\log \frac{(r_i p_K)}{(r_K p_i)}$$ corresponding to the $i$ vs $K$ contrast. For $K=2$, $p_1$ is as before and $p_K = p_2 = 1-p_1$, so it reduces to $$\log \frac{r_1 (1-p_1)}{(1-r_1) p_1}.$$
However, I'd be happy to be corrected on this one.