To decide whether to use logit, probit or a linear probability model I compared the marginal effects of the logit/probit models to the coefficients of the variables in the linear probability model. However, since they are not similar, I am not sure how to go about choosing a model that would best fit?
Solved – How to choose between logit, probit or linear probability model
econometricsgeneralized linear modellogisticprobit
Related Solutions
I think a better way to see the marginal effect of a given variable, say $X_j$, is to produce a scatter plot of the predicted probability on the vertical axis, and to have $X_j$ on the horizontal axis. This is the most "layman" way I can think of indicating how influential a given variable is. No maths, just pictures. If you have a lot of data points, then a boxplot, or scatterplot smoother may help to see where most of the data is (as oppose to just a cloud of points).
Not sure how "Layman" the next section is, but you may find it useful.
If we look at the marginal effect, call it $m_j$, noting that $g(p)=\sum_kX_k\beta_k$, we get
$$m_j=\frac{\partial p}{\partial X_j}=\frac{\beta_j}{g'\left[g^{-1}(X^T\beta)\right]}=\frac{\beta_j}{g'(p)}$$
So the marginal effect depends on the estimated probability and the gradient of the link function in addition to the beta. The dividing by $g'(p)$, comes from the chain rule for differentiation, and the fact that $\frac{\partial g^{-1}(z)}{\partial z}=\frac{1}{g'\left[g^{-1}(z)\right]}$. This can be shown by differentiating both sides of the obviously true equation $z=g\left[g^{-1}(z)\right]$. We also have that $g^{-1}(X^T\beta)=p$ by definition. For a logit model, we have $g(p)=\log(p)-\log(1-p)\implies g'(p)=\frac{1}{p}+\frac{1}{1-p}=\frac{1}{p(1-p)}$, and the marginal effect is:
$$m_j^{logit}=\beta_jp(1-p)$$
What does this mean? well $p(1-p)$ is zero at $p=0$ and at $p=1$, and it reaches its maximum value of $0.25$ at $p=0.5$. So the marginal effect is greatest when the probability is near $0.5$, and smallest when $p$ is near $0$ or near $1$. However, $p(1-p)$ still depends on $X_j$, so the marginal effects are complicated. In fact, because it depends on $p$, you will get a different marginal effect for different $X_k,\;k\neq j$ values. Possibly one good reason to just do that simple scatter plot - don't need to chose which values of the covariates to use.
For a probit model, we have $g(p)=\Phi^{-1}(p)\implies g'(p)=\frac{1}{\phi\left[\Phi^{-1}(p)\right]}$ where $\Phi(.)$ is standard normal CDF, and $\phi(.)$ is standard normal pdf. So we get:
$$m_j^{probit}=\beta_j\phi\left[\Phi^{-1}(p)\right]$$
Note that this has most of the properties that the $m_j^{logit}$ marginal effect I discussed earlier, and is equally true of any link function which is symmetric about $0.5$ (and sane, of course, e.g. $g(p)=tan(\frac{\pi}{2}[2p-1])$). The dependence on $p$ is more complicated, but still has the general "hump" shape (highest point at $0.5$, lowest at $0$ and $1$). The link function will change the size of the maximum height (e.g. probit maximum is $\frac{1}{\sqrt{2\pi}}\approx 0.4$, logit is $0.25$), and how quickly the marginal effect is tapered towards zero.
The question of what model to use has to do with the objective of the analysis.
If the objective is to develop a classifier to predict binary outcomes, then (as you can see), these three models are all approximately the same and give you approximately the same classifier. That makes it a moot point since you don't care what model develops your classifier and you might use cross validation or split sample validation to determine which model performs best in similar data.
In inference, all models estimate different model parameters. All three regression models are special cases of GLMs which use a link function and a variance structure to determine the relationship between a binary outcome and (in this case) a continuous predictor. The NLS and logistic regression model use the same link function (the logit) but the NLS minimizes squared error in the fitting of the S curve where as the logistic regression is a maximum likelihood estimate of the model data under the assumption of the linear model for model probabilities and the binary distribution of observed outcomes. I can't think of a reason why we'd consider the NLS to be useful for inference.
Probit regression uses a different link function which is the cumulative normal distribution function. This "tapers" faster than a logit and is often used to make inference on binary data that is observed as a binary threshold of unobserved continuous normally distributed outcomes.
Empirically, the logistic regression model is used far more often for analysis of binary data since the model coefficient (odds-ratio) is easy to interpret, it is a maximum likelihood technique, and has good convergence properties.
Best Answer
Modeling a dichotomous outcome using linear regression is a big no-no. The error terms will not be normally distributed, there will be heteroskedasticity, and predicted values will fall outside the logical boundaries of 0 and 1.
Logit and probit differ in the assumption of the underlying distribution. Logit assumes the distribution is logistic (i.e. the outcome either happens or it doesn't). Probit assumes the underlying distribution is normal which means, essentially, that the observed outcome either happens or doesn't but this reflects a certain threshold being met for the underlying latent variable which is normally distributed.
In practice the end result of these different distributional assumptions is that coefficients differ, usually by a factor of about 1.6. However, if you look at marginal effects (meaning the effects on the predicted mean of the outcome holding other covariates at the mean or averaging over observed values) the logit and probit models will make essentially the same predictions. So if you're looking at marginal effects the choice probably doesn't matter.
On the other hand, if you're not going to go about calculating the margins then logit has the obvious advantage of generating coefficients that can be transformed into the familiar odds ratio by exponentiating the coefficient. Probit coefficients are essentially uninterpretable - given a probit model I would report average marginal effects for this very reason. Of course most people improperly interpret odds ratios as probabilities which is a big no-no. The odds of an outcome occurring is a ratio of successes to failures (an odds of 1 would correspond to a probability of .5). Odds RATIOS, then, reflect the predicted change in the odds given a 1 unit change in the predictor. Thus, the odds ratio reflects change relative to the base odds of the outcome occurring. Given an outcome that either rarely occurs or almost always occurs, a small change in probability can correspond to a large odds ratio. Odds ratios are a ratio of ratios which can be quite confusing and so we arrive at a reason to report marginal effects in the context of a logit model.
So, to summarize, don't use a linear probability model. Use logit or probit and report the marginal effects. The choice is, perhaps, of theoretical significance but probably of no practical consequence if reporting marginal effects. If you're not going to report marginal effects then use logit but be sure to properly interpret the odds ratios so you don't look like an uninformed idiot.