Regression – How to Assess Violations of Gauss-Markov Assumptions in a Linear Probability Model

regression

When fitting a multivariate Linear Probability Model (LPM), predicting a DV that is either 0 or 1 and interpreting the prediction of the LPM as a probability, I can use an OLS estimator to calculate the regression coefficients.

$$Y = \alpha + \sum_j \beta_j X_j + \varepsilon\\
\widehat Y = \alpha + \sum_j \beta_j X_j$$

The Gauss-Markov theorem states that OLS estimators are BLUE provided that the error random variables $\varepsilon_i$:

1.) are uncorrelated,

2.) have mean zero $\operatorname E[\varepsilon_i] = 0$,

3.) are homoscedastic.

Question 1: Regarding an LPM, I know that $\operatorname{Var}(\varepsilon) = \widehat Y (1 – \widehat Y)$. Therefore, the variance of the error term depends on the value of $\widehat Y$ and thereby also on the values of $X_j$. Hence, the error term is heteroscedastic and an LPM inevitably violates 3.). Is that correct?

Question 2: What about 1.) and 2.)? Can an LPM comply with these conditions of the Gauss-Markov theorem or are they also violated in all cases?

Best Answer

Indeed, LPM necessarily violates Assumption 3. Because this assumption is violated, LPM is not efficient hence it is not BLUE, because it is not the Best amongst Linear Unbiased estimates.

Nothing about LPM necessarily violates Assumptions 1 or 2. LPM will still be unbiased. You ask how can it be unbiased when it produces predicted values outside the 0-1 range. Great question. That the LPM estimator is unbiased means that if you were to take infinitely many samples of the size of your sample and estimated the betas for each of these samples using LPM, the sampling distribution of each of these betas will be centered at their true population value. In other words, the expected value of the LPM estimator is the population value. Given that LPM gives us predicted values outside the 0-1 range, doesn't that mean the LPM estimator is biased? No. And the reason is that LPM is a linear model, and as long as E(u|x1,x2,x3...)=0 it means that you are assuming that the true population regression is:

enter image description here

which, when y is binary, is equivalent to assuming that:

enter image description here

which, in words, means you are assuming that the P(y=1|x) is a linear function of the x's. But, because probabilities cannot exceed 1 or be lower than 0, the effect of x on the probability that y=1 must be nonlinear. Therefore your LPM model is mis-specified. The expected value of the LPM estimator is still the population value, but for a model that is mis-specified. Think of it this way: LPM still does it's job well (is unbiased) but not the job you wanted it to do. Note that LPM is nothing but OLS. It has a different name because with a binary dependent variable, we can interpret E(y|x) as P(y=1|x). Only the lens through which we look at OLS changes, but nothing about the process changes - it is the same old OLS, with a binary dependent variable. So, in LPM we switch to an interpretation in terms of probabilities but we use the same old OLS, so OLS will give us perfectly legitimate, from OLS' point of view, predicted values, even if they fall outside the 0-1 range.

The main reason for not using LPM is in the fact that it will produce predicted values outside the 0-1 range. A related issue is that a probability, which must be no less than 0 and no greater than 1 cannot be linearly related to the independent variables for all their possible values. Nevertheless, there are valid applications of LPM.