Regression Comparison – Probit Vs Logistic Regression in Machine Learning

classificationlogisticmachine learningprobit

Why is the probit model not as popular as logistic regression for binary classification among the machine learning community? It is not or hardly mentioned in serious text books on the topic.

Best Answer

The probit link arose, in ancient times, from the idea of a latent continuous variable with a Normal distribution. This was natural in toxicology, for example, where you might think about the reason some flies died and others survived in terms of differences in individual sensitivity from some distribution

In maths: $$Y^*=\alpha+\beta X+\epsilon$$ with $\epsilon\sim N(0,1)$ followed by $Y=\mathbb{1}\{Y^*>0\}$, gives $$\Phi(P(Y=1)|X=x))=\alpha+\beta X$$

Back then, looking up a probit wasn't that much slower than computing a logit (on a mechanical calculator or slide rule) so there wasn't much computational difference, and a Normal latent variable seems natural.

The logit link was known to not be all that different from the probit, apart from a scale factor in the coefficients of something like $\pi/\sqrt{3}$, so there's not much need to have both models lying around. Some fields used probits, some used logits (eg, epidemiology, because of the nice properties of the odds ratio with respect to case-control sampling and the arguably simpler coefficient interpretation)

When we got to generalised linear models and computers, the probit was a bit inconvenient: logs and exponentials are going to be readily available in your favourite programming language, but the Normal quantile and CDF may not be. There are also speed issues: log and exp became available in hardware floating point units, and later in GPUs.

Because the logit and probit are not very different, it's hard to find applications where it makes a lot of practical difference which one you use. The main sanctuary for the endangered probit model was in settings where the Normal latent variable makes calculations easier.

For example, Charles McCulloch fitted random-effects probit models, by writing the model in terms of latent variables $$Y^*=\alpha+\beta X+u+\epsilon$$ with $\epsilon\sim N(0,1)$ and $u\sim N(0,\tau^2)$ followed by $Y=\mathbb{1}\{Y^*>0\}$, gives $$\Phi(P(Y=1)|X=x, U=u))=\alpha+\beta X+u$$ There's a clever EM algorithm for fitting this model, treating $u$ as missing data, where the M-step looks like a linear mixed model and the E-step samples the latent variables from their distribution conditional on the observed variables.

It's easier to do this sort of thing with a probit link because the latent Normal distribution of $\epsilon$ has simple convolutions and conditional distributions when combined with other Normal latent variables. And since the model won't be very different from a (rescaling of) a logit model, it still makes sense to use the probit model in settings where you'd otherwise want a logit model.

Machine learning doesn't seem to have a lot of models where it's necessary to do this sort of clever maths with latent variables, so there's not as much to balance the computational and interpretation advantages of the logit link.

Related Question