Solved – When do you use logistic regression vs. when you do use OLS

least squareslogisticregression

If you are creating a regression model where the response variable is a numerical value, but one of the variables is a dummy (binary), can you use OLS-method?

Do you only use logistic regression if your response variable is binary?

Best Answer

In short:

Yes, if your response variable is continuous, even if one of the X variables is binary, you can use OLS.
Yes, you should only use logistic regression if your response variable is binary.

Why is this? To your first question, OLS handles binary X variables just fine. For simplicity, consider a model with two X variables. (Assume that they aren't perfectly correlated.) $$ y_i = \alpha + \beta_1 x_{1i} + \beta_2 x_{2i} + \varepsilon_i $$ If $x_{1i}$ is binary, then we interpret $\beta_1$ as follows: holding $x_{2i}$ constant, it's the predicted change in $y_i$ from observing $x_{i1} = 1$ instead of 0. (This holds whether $x_{2i}$ is continuous or binary, by the way.)

To your second question, binary logistic regression is designed specifically to model a binary response. Recall that the underlying model is the following: for a vector $x_i$, the probability that $y_i = 1$ is $$ P(y_i = 1 \mid x_i) = \frac{\exp(\beta' x_i)}{1 + \exp(\beta' x_i)}. $$ This doesn't generalize to continuous $y$. (It does generalize to categorical $y$; this is called multinomial logistic regression.)

Related Solutions

Solved – Interpreting logistic regression results when explanatory variable has multiple levels

The interpretation for categorical variables with more than 2 levels is very similar to the binary case you mention; for a $k$-level categorical variable, you will have $k-1$ regression coefficients each of which compare the odds of the outcome to the reference group. For the example you state, ethnicity (Caucasian, African-American, Hispanic, and Asian), let us assume your referent (baseline) group is African-American. Many software packages for logistic regressions will give you 3 Odds ratios (for a 4-level categorical predictor) once you run the regression. Let us quickly look at how this is done in R based on simulated dataset:

    ###########Simulate Data###########
    set.seed(123) # set seed if you want to re-produce 
                  #simulation results
    x1 <- sample(c("AF","AS","HI","CA"),10000,replace = T) #Caucasian (CA), African-American(AA), Hispanic(HI), 
# and Asian(AA)
    x1 <- factor(x1,levels =c("AF","AS","HI","CA")) 
    # ensure the ordering by setting AF as reference

    x1.fac <- model.matrix(~ x1) # generate dummy variables for 
      #simulation purposes (in practice you may not need to do 
      #this)
    betas <- c(.2,.5,.53) # log odds comparing the three groups 
      #to the referent level of AF (these are just made up 
      #values for illustration and simulation purposes!)
    xbeta <- x1.fac[,-1]%*%betas #need only k-1 dummies for a 
                                 #variable with k-levels
    y <- rbinom(n = 10000, size = 1, prob = 
            exp(xbeta)/(1+exp(xbeta))) # Simulate outcome (Y)

#Finally we have the following sample data:

    example_data <- data.frame(y,x1)
    
    ####Run regression of outcome against ethnicity
    model1 <- glm(y~x1,family = binomial,data = example_data)
    exp(coef(model1))[-1] ###Odds Ratios comparing each group 
                          #with the reference group of AF
    x1AS     x1HI     x1CA 
    1.229610 1.800985 1.796416

So what does the odds ratio of 1.23 for Asians mean? This means, compared to African-Americans Asians had 23% higher odds of the outcome. Equivalently, you can interpret as Asians have 1.23 times the odds of the outcome compared to the referent group of African Americans. The odds of 1.800 and 1.796 for Caucasians and Hispanics, respectively, are interpreted in the same manner. The most important part of modeling categorical variables is identifying the proper referent group. You can always change the reference group by using the relevel() command in R. See example here.

In order to make comparison between two groups where one of them is not a referent group, there are a few ways to go:

Use relevel() function and re-run the regression changing the reference group to your variable of interest (not my favorite approach when there are many levels in your categorical predictor)
Use already built in packages to do this comparison.

I am not sure how this is done in Stata or SAS (probably contrast statement for SAS) but you can easily do this in R using the car package. For example, if you want to test if the odds of the outcome differ between Caucasians and Hispanics, use the following commands:

    library(car)
    linearHypothesis(model1, c("x1CA - x1HI = 0"))

    Linear hypothesis test
    
    Hypothesis:
    - x1HI  + x1CA = 0
    
    Model 1: restricted model
    Model 2: y ~ x1
    
      Res.Df Df  Chisq Pr(>Chisq)
    1   9997                     
    2   9996  1 0.0018     0.9658

In this case, we fail to reject the null hypothesis of no difference in the odds of the outcome between Caucasians and Hispanics (p-value=0.9658).

Solved – Linear Regression with Maximum Likelihood or OLS + Logistic Regression

I think there is a lot of confusion here. First, I want to remind you that OLS and MLE are statistical algorithms for estimating parameters from data. OLS says, to get the parameters estimates for a linear model, find those that minimize the sum of the squared residuals. MLE says, to get the parameter estimates for a model, find those that maximize the likelihood, which is a function that depends on characteristics proposed by the analyst.

It turns out that for a linear model, the model coefficients estimated by OLS are identical to those estimated using MLE because maximizing the likelihood is equivalent to minimizing the sum of the squared residuals when the user programs MLE in a specific way, that is, assuming the conditional density of the outcome (i.e., the density fo the error) is normally distributed. I'll this specific application of MLE $\text{MLE}_{lge}$, i.e., MLE for a linear model with Gaussian errors. $\text{MLE}_{lge}$ corresponds to finding the parameter estimates $\beta$ and $\sigma$ such that $L_{lge}(\beta,\sigma)$ is maximized, where $$L_{lge}(\beta,\sigma)= \prod\limits_{i=1}^N \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i-\mu_i)^2}{2\sigma^2}\right)$$ and $\mu_i=g(X_i)=X_i\beta$. It turns out that the $\beta$ estimates that maximize $L(\beta,\sigma)$ are exactly the same ones that minimize sum of the squared residuals (though the $\sigma$ estimates are different).

MLE is consistent when the likelihood is correctly specified. For linear regression, the likelihood is usually specified assuming a normal distribution for the errors (i.e., as $L_{lge}(\beta,\sigma)$ above). $\text{MLE}_{lge}$ is not even necessarily consistent when the errors are not normally distributed. OLS is at least consistent (and unbiased) even when the errors are not normally distributed. Because the $\beta$ estimates resulting from OLS and $\text{MLE}_{lge}$ are identical, it doesn't matter which one you use in the face of non-normality (though, again, the $\sigma$ estimates will differ).
The interpretation of parameter estimates has nothing to do with the method used to estimate them. I could pull a number out of a hat and call it the slope and it would have the same interpretation as an estimate resulting from a more legitimate method (like OLS). I go into detail about this here. The interpretation of parameter estimates comes from the model, not the method used to estimate them.
The consistency of MLE depends on correct specification of the likelihood function, which is related to the density of the outcome given the covariates. For $\text{MLE}_{lge}$, we assume the density of each outcome is a normal distribution with mean $X_i\beta$ and variance $\sigma^2$. For binary outcomes, it often makes the most sense to think that each outcome has a Bernoulli distribution with probability parameter $p_i = g(X_i)$, where $g(X_i)$ is the logit function $\frac{1}{1+\exp(-X_i \beta)}$ for logistic regression or the normal CDF for probit regression, but one can also think that the outcome has a Poisson distribution with mean parameter $\lambda_i = g(X_i)$, as done in Chen et al. (2018).
What you described is not how logistic regression works. First, you specify a likelihood function assuming a specific density, which in this case is a Bernoulli distribution with probability parameter $p_i = g(X_i) = \frac{1}{1+\exp(-X_i \beta)}$. The likelihood is then $L(\beta) = \prod\limits_{i=1}^N p_i^{y_i} (1-p_i)^{1-y_i}$. Then you find the values of $\beta$ that maximize the likelihood (which you can do using various algorithms). Statistically, it is a one-step procedure (though the actual method of estimation is an iterative process).

Here are the general steps for maximum likelihood estimation:

Propose a distribution for each individual's outcome $y_i$. For a continuous outcome, we might think it is drawn from a normal distribution with mean $\mu_i$ and variance $\sigma^2$, and for a binary outcome, we might think it is drawn from a Bernoulli distribution with probability $p_i$.
Propose a relationship between the distribution parameters and the collected variables. For a continuous outcome, we might think the mean is a linear function of the predictors, i.e., $\mu_i = g(X_i) = X_i\beta$, and the variance is constant. For a binary outcome, we might think the probability parameter is a logistic function of a linear combination of the predictors, i.e., $p_i = g(X_i) = \frac{1}{1+\exp(-X_i \beta)}$. This is called logistic regression. If we instead thought $p_i$ was a probit function, then it would be probit regression.
Specify the likelihood function as a product of the individual contributions to the likelihood, which essentially is a re-write of the proposed density functions.
Find the parameter values that maximize the likelihood; these values are the parameter estimates. If the proposed distributions for the outcomes were correct, the estimates will be consistent for their true values. (Note that maximizing the likelihood is equivalent to maximizing the log of the likelihood, so that is often done instead because computation is easier.)

To recap: OLS and MLE are both ways of estimating model parameters from data. MLE requires certain specifications by the user about the distribution of the outcome; if those specifications are correct, the estimates are consistent. $\text{MLE}_{lge}$ is one form of MLE with a specific distributional form specified. $\text{MLE}_{lge}$ and OLS yield the same slope estimates regardless of the true nature of the data (i.e., whether assumptions about normality are met). The estimates from each method are interpreted the same because the interpretation doesn't come from the estimation method. MLE for logistic regression is performed by specifying a different distribution for the outcomes (which are binary).

Best Answer

Related Solutions

Solved – Interpreting logistic regression results when explanatory variable has multiple levels

Solved – Linear Regression with Maximum Likelihood or OLS + Logistic Regression

Related Question