Solved – Linear Regression with Maximum Likelihood or OLS + Logistic Regression

logisticmaximum likelihoodregression

As far as I know, regression parameters can be estimated among others with the maximum likelihood method or with the OLS method. If the assumption of normally distributed residuals is fulfilled, both methods lead to identical parameters.

This raises some questions for me:

  1. for an efficient parameter estimation by OLS, the residuals must be normally distributed. However, this is not a necessary assumption of MLE estimation. Therefore, is MLE estimation always preferable to OLS estimation if the residuals are not normally distributed?

  2. Suppose I want to analyze the relationship between the sales price of my car and its mileage with a linear regression analysis, and I set up the following model: Sales_price_i = ß0 + ß1x_i + e
    Let's assume that the assumptions of the OLS estimation are satisfied. Consequently, OLS and MLE estimation would lead to the same estimation result. Are the coefficients of an MLE estimation interpreted in the same way as those of an OLS estimation?

  1. I can use the MLE method to estimate the parameters of an assumed probability distribution. To do this, a certain probability distribution is assumed (e.g. normal distribution). Which distribution is assumed if the parameters of a regression model are to be estimated by MLE? In logistic regression, is it the Bernoulli distribution because the Y variable can only take two values?
  1. MLE estimation also plays a role in logistic regression, where the Y variable is binary and can only take two values. Am I correct in my understanding of the procedure: First, the MLE method is used to estimate the coefficients of a linear regression. This is estimated using MLE because the residuals are not normally distributed for a binary distributed random variable. In the next step, the estimated Yi parameters are plugged into the logistic distribution function. The logistic function can be understood as a transformation function. It ensures that the function values will be 0>= and <=1.

Thanks!

Best Answer

I think there is a lot of confusion here. First, I want to remind you that OLS and MLE are statistical algorithms for estimating parameters from data. OLS says, to get the parameters estimates for a linear model, find those that minimize the sum of the squared residuals. MLE says, to get the parameter estimates for a model, find those that maximize the likelihood, which is a function that depends on characteristics proposed by the analyst.

It turns out that for a linear model, the model coefficients estimated by OLS are identical to those estimated using MLE because maximizing the likelihood is equivalent to minimizing the sum of the squared residuals when the user programs MLE in a specific way, that is, assuming the conditional density of the outcome (i.e., the density fo the error) is normally distributed. I'll this specific application of MLE $\text{MLE}_{lge}$, i.e., MLE for a linear model with Gaussian errors. $\text{MLE}_{lge}$ corresponds to finding the parameter estimates $\beta$ and $\sigma$ such that $L_{lge}(\beta,\sigma)$ is maximized, where $$L_{lge}(\beta,\sigma)= \prod\limits_{i=1}^N \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i-\mu_i)^2}{2\sigma^2}\right)$$ and $\mu_i=g(X_i)=X_i\beta$. It turns out that the $\beta$ estimates that maximize $L(\beta,\sigma)$ are exactly the same ones that minimize sum of the squared residuals (though the $\sigma$ estimates are different).

  1. MLE is consistent when the likelihood is correctly specified. For linear regression, the likelihood is usually specified assuming a normal distribution for the errors (i.e., as $L_{lge}(\beta,\sigma)$ above). $\text{MLE}_{lge}$ is not even necessarily consistent when the errors are not normally distributed. OLS is at least consistent (and unbiased) even when the errors are not normally distributed. Because the $\beta$ estimates resulting from OLS and $\text{MLE}_{lge}$ are identical, it doesn't matter which one you use in the face of non-normality (though, again, the $\sigma$ estimates will differ).

  2. The interpretation of parameter estimates has nothing to do with the method used to estimate them. I could pull a number out of a hat and call it the slope and it would have the same interpretation as an estimate resulting from a more legitimate method (like OLS). I go into detail about this here. The interpretation of parameter estimates comes from the model, not the method used to estimate them.

  3. The consistency of MLE depends on correct specification of the likelihood function, which is related to the density of the outcome given the covariates. For $\text{MLE}_{lge}$, we assume the density of each outcome is a normal distribution with mean $X_i\beta$ and variance $\sigma^2$. For binary outcomes, it often makes the most sense to think that each outcome has a Bernoulli distribution with probability parameter $p_i = g(X_i)$, where $g(X_i)$ is the logit function $\frac{1}{1+\exp(-X_i \beta)}$ for logistic regression or the normal CDF for probit regression, but one can also think that the outcome has a Poisson distribution with mean parameter $\lambda_i = g(X_i)$, as done in Chen et al. (2018).

  4. What you described is not how logistic regression works. First, you specify a likelihood function assuming a specific density, which in this case is a Bernoulli distribution with probability parameter $p_i = g(X_i) = \frac{1}{1+\exp(-X_i \beta)}$. The likelihood is then $L(\beta) = \prod\limits_{i=1}^N p_i^{y_i} (1-p_i)^{1-y_i}$. Then you find the values of $\beta$ that maximize the likelihood (which you can do using various algorithms). Statistically, it is a one-step procedure (though the actual method of estimation is an iterative process).

Here are the general steps for maximum likelihood estimation:

  1. Propose a distribution for each individual's outcome $y_i$. For a continuous outcome, we might think it is drawn from a normal distribution with mean $\mu_i$ and variance $\sigma^2$, and for a binary outcome, we might think it is drawn from a Bernoulli distribution with probability $p_i$.
  2. Propose a relationship between the distribution parameters and the collected variables. For a continuous outcome, we might think the mean is a linear function of the predictors, i.e., $\mu_i = g(X_i) = X_i\beta$, and the variance is constant. For a binary outcome, we might think the probability parameter is a logistic function of a linear combination of the predictors, i.e., $p_i = g(X_i) = \frac{1}{1+\exp(-X_i \beta)}$. This is called logistic regression. If we instead thought $p_i$ was a probit function, then it would be probit regression.
  3. Specify the likelihood function as a product of the individual contributions to the likelihood, which essentially is a re-write of the proposed density functions.
  4. Find the parameter values that maximize the likelihood; these values are the parameter estimates. If the proposed distributions for the outcomes were correct, the estimates will be consistent for their true values. (Note that maximizing the likelihood is equivalent to maximizing the log of the likelihood, so that is often done instead because computation is easier.)

To recap: OLS and MLE are both ways of estimating model parameters from data. MLE requires certain specifications by the user about the distribution of the outcome; if those specifications are correct, the estimates are consistent. $\text{MLE}_{lge}$ is one form of MLE with a specific distributional form specified. $\text{MLE}_{lge}$ and OLS yield the same slope estimates regardless of the true nature of the data (i.e., whether assumptions about normality are met). The estimates from each method are interpreted the same because the interpretation doesn't come from the estimation method. MLE for logistic regression is performed by specifying a different distribution for the outcomes (which are binary).