Solved – Is a logistic regression biased when the outcome variable is split 5% – 95%

logisticmodeling

I am building a propensity model using logistic regression for a utility client.
My concern is that out of the total sample my 'bad' accounts are just 5%, and the rest are all good.
I am predicting 'bad'.

  • Will the result be biassed?
  • What is optimal 'bad to good proportion' to build a good model?

Best Answer

I disagreed with the other answers in the comments, so it's only fair I give my own. Let $Y$ be the response (good/bad accounts), and $X$ be the covariates.

For logistic regression, the model is the following:

$\log\left(\frac{p(Y=1|X=x)}{p(Y=0|X=x)}\right)= \alpha + \sum_{i=1}^k x_i \beta_i $

Think about how the data might be collected:

  • You could select the observations randomly from some hypothetical "population"
  • You could select the data based on $X$, and see what values of $Y$ occur.

Both of these are okay for the above model, as you are only modelling the distribution of $Y|X$. These would be called a prospective study.

Alternatively:

  • You could select the observations based on $Y$ (say 100 of each), and see the relative prevalence of $X$ (i.e. you are stratifying on $Y$). This is called a retrospective or case-control study.

(You could also select the data based on $Y$ and certain variables of $X$: this would be a stratified case-control study, and is much more complicated to work with, so I won't go into it here).

There is a nice result from epidemiology (see Prentice and Pyke (1979)) that for a case-control study, the maximum likelihood estimates for $\beta$ can be found by logistic regression, that is using the prospective model for retrospective data.

So how is this relevant to your problem?

Well, it means that if you are able to collect more data, you could just look at the bad accounts and still use logistic regression to estimate the $\beta_i$'s (but you would need to adjust the $\alpha$ to account for the over-representation). Say it cost $1 for each extra account, then this might be more cost effective then simply looking at all accounts.

But on the other hand, if you already have ALL possible data, there is no point to stratifying: you would simply be throwing away data (giving worse estimates), and then be left with the problem of trying to estimate $\alpha$.

Related Question