Solved – Oversampling correction for multinomial logistic regression

logisticmultinomial-distributionoversamplingregression

When modeling rare events with logistic regression, oversampling is a common method to reduce computation complexity (i.e., keep all the rare positive cases but just a subsample of negative cases). After model fitting, adding a offset to the intercept term is a common method to correct the event probability to reflect the original sample proportion. The offset is equal to log( r1*(1-p1) / (1-r1)*p1 ), where r1 is the proportion of rare events in the oversampled data and p1 is the proportion in the original data. What is the equivalent formula with multinomial logistic regression, where 1 or more classes is oversampled?

Best Answer

Off the cuff, I presume one could proceed as in logistic regression: a generalisation to $K>2$ categories and base category $K$ would be to set the $i$-th correction term to be $$\log \frac{(r_i p_K)}{(r_K p_i)}$$ corresponding to the $i$ vs $K$ contrast. For $K=2$, $p_1$ is as before and $p_K = p_2 = 1-p_1$, so it reduces to $$\log \frac{r_1 (1-p_1)}{(1-r_1) p_1}.$$

However, I'd be happy to be corrected on this one.