Solved – Does Linear regression needs target variable to be normally distributed. (GLM context)

exponential-familygeneralized linear modelnormal distributionregression

I came across the assumptions of linear regression that said:
–>The residuals should be normally distributed.

GLM(Generalized Linear model) assumes that target variable should follow one of the exponential family.

So does linear regression needs residuals as well as target variable to be distributed normally?

EDIT

https://online.stat.psu.edu/stat504/node/216/

In the above mentioned, it is written –

There are three components to any GLM:

  1. Random Component – refers to the probability distribution of the response variable (Y); e.g. normal distribution for Y in the linear regression, or binomial distribution for Y in the binary logistic regression.

Moreover in the assumption section,

The dependent variable Yi does NOT need to be normally distributed, but it typically assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal,…)

I am new to machine learning, forgive me if i'm asking stupid question.

Best Answer

It depends on what you’re doing. If you just want to predict, then it doesn’t matter, and the Gauss-Markov theorem does not say anything about a normal error term.

However, when the error term is normal, then the OLS estimator $\hat{\beta}$ is the maximum likelihood estimator. If you don’t know about MLEs, you’ll see them over and over as you dive into statistics, but maximum likelihood is a nice property for many reasons.

Among those reasons is that the inferential methods like p-values on coefficients and F-tests of nested models come into play.

So if you want to do some kind of ANOVA, for example, the normality of the error term matters because you’re doing hypothesis testing, not prediction.

The pooled distribution of the response variable (all of your $y$s) definitely does not have to be normal, even to get that maximum likelihood property and do inference, and the predictor variables definitely don’t have to be normal. Predictors often cannot be normal, such as when they are categorical variables e.g. male/female, treatment/control, etc.

EDIT

We often talk about normal residuals. This is casual language, and experienced statisticians know what is meant, but the residuals are a discrete distribution and cannot be normal. What we assume is a normal error term, and we use the residuals to gauge if that is a good assumption or not.

Related Question