Solved – Linear regression with strongly non-normal response variable

assumptionsregressionresiduals

I have carried out a linear regression. The plot below shows the distribution of the response variable:

enter image description here

I believe the response variable is beta distributed, therefore virtually the exact opposite of normally distributed. However, when including all my predictor variables in the linear regression, the residuals turn out to be quite normally distributed, as shown in this plot:

enter image description here

Has my model satisfied the assumptions of linear regression? Might there be a better model to use?

Best Answer

The distribution of the response is irrelevant. Inference based on small samples requires the errors to be approximately normal (better look at the QQ-plot of the residuals than at its density because the tails are important). If you are only interested in descriptive results or if the sample size is not too small, you therefore do not need to worry about normality.

Much more important are the other assumptions of linear regression (correct model structure, no large outliers in the predictors and, if you are interested in inference, homoscedastic and uncorrelated errors).

Related Solutions

Solved – Residuals correlated positively with response variable strongly in linear regression

1) Residuals do correlate positively with observed values in many, many cases. Think of it this way - a very large positive error ("error" is the "true residual", to misuse the language) means that the corresponding observation is, all other things equal, likely to be very large in a positive direction. A very large negative error means that the corresponding observation is likely to be very large in a negative direction. If the $R^2$ of the regression is not large, then the variability of the errors will be the dominating effect on the variability of the target variable, and you will see this effect in your plots and correlations.

For example, consider the model $y_i = a + x_i + e_i$, which we'll model as $y_i = a + bx_i + e_i$, (which is correct for $b = 1$.) Here's the result of a regression with 100 observations:

e <- rnorm(100)
x <- rnorm(100)
y <- 1 + x + e

foo <- lm(y~x)
plot(residuals(foo)~y, xlab="y", ylab="Residuals")

> summary(foo)

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3292 -0.8280 -0.0448  0.8213  2.9450 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.8498     0.1288   6.600 2.12e-09 ***
x             0.8929     0.1316   6.787 8.81e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1.286 on 98 degrees of freedom
Multiple R-squared: 0.3197, Adjusted R-squared: 0.3128 
F-statistic: 46.06 on 1 and 98 DF,  p-value: 8.813e-10

enter image description here

Note that we achieved a fairly respectable (in some fields) $R^2$ of 0.32.

We can obscure this effect with a different model:

y <- 1 + 5*x + e

foo <- lm(y~x)
plot(residuals(foo)~y, xlab="y", ylab="Residuals")

which has an $R^2$ of 0.93 and the following residual plot:

enter image description here

Here the correlation between $y$ and the residuals is about 0.25, but it's a lot less obvious on the plot.

2) Residuals have correlation zero with fitted values in a linear regression, by construction. Is your statement "... weakly correlated with fitted Y negatively" based solely upon looking at the plot, or did you actually calculate the correlation? If the former, appearances can be deceiving... if the latter, something is wrong; possibly you aren't looking at what you think you're looking at.

Solved – Normality of residuals in a regression model with a categorical IV

(Note that a regression model with only 1 explanatory variable that is categorical and has just 2 levels is equivalent to a t-test; there's nothing wrong with calling it a regression, but it would most commonly be discussed / referred to as a t-test.)

You check the distribution of all the residuals simultaneously. There are tests for normality, but I'm not a huge fan of them (I listed some in my answer to your previous question). I think the best option is to make a qq-plot. You can find a really nice version (qq.plot) in John Fox's car package. Among other features, it'll give you a 95% confidence band, which can help you interpret the plot.

On a different note, from looking at your plot, I don't know if you have more data in the second group, but you should also check to ensure you have homogeneity of variance.

Best Answer

Related Solutions

Solved – Residuals correlated positively with response variable strongly in linear regression

Solved – Normality of residuals in a regression model with a categorical IV

Related Question