Solved – Residual Analysis and ANOVA Model

anovaresiduals

I am very new to residual analysis and ANOVA. To my understanding, in the residual plot, residuals should not show obvious patterns, thus if the pattern is random, it indicates a good fit for a linear model. I have generated some random noise in R and have fitted an ANOVA model and plotted the residuals and now I am trying to understand what the residual plot is telling me about the model and how good it is, but I cannot really analyze the plot in depth and also do not understand whether there is a pattern being shown. Should the pattern be recognized with regards to the horizontal line or any type of pattern should be considered? I will really appreciate it if someone can explain in detail.

P.S. Both of the plots are showing exactly the same thing, just one of them is produced in a "fancier" way!

anova_model <- aov(Measurement ~ Treatment, data=Data)


residuals <- resid(anova_model)
plot(Data$Measurement, residuals, xlab="Measurement", ylab="Residuals")
abline(0,0)

qplot(Data$Measurement, residuals, colour = Data$Treatment, 
      shape = Data$Treatment, size=I(3.9), 
      xlab="Measurement Values", ylab="Residuals") + 
      labs(colour="Treatment Categories", shape = "Treatment Categories")

Model <- data.frame(Fitted = fitted(anova_model), 
                    Residuals = resid(anova_model), 
                    Treatment = Data$Treatment)
ggplot(Model, aes(Fitted, Residuals, colour = Treatment)) + geom_point()

Data was generated using the following code in R:

X <- matrix(rep(1:5, each=5), nrow=5, ncol=5, byrow=FALSE)
Y <- matrix(rnorm(X, mean=0, sd=1), nrow=5, ncol=5, byrow=FALSE)
Treatment <- as.vector(X)
Measurement <- as.vector(Y)
Data <- data.frame(Measurement,Treatment)
Data$Treatment <- as.factor(Data$Treatment)

Best Answer

Since you did not set.seed I was unable to reproduce verbatim your results, but I did pick a seed for further replication (set.seed(1)), and left everything else unchanged.

In the initial exploratory phase the first thing that jumps at you is the variability of the boxplots for each one of the treatments given the small number of random ($\sim N(0,1)$) data points in each treatment:

In particular, notice the IQR of treatment $3$, as well as its extreme values.

It is only when you aggregate the data points across treatments that you start to get a glimpse of the underlying (by design) normal distribution:

So it is not surprising that the Residual v Fitted plot will tend to reflect the variations between groups resulting from the small samples, tending to show "patterns" that we know are not there:

You can see how the vertical colored lines (corresponding to treatments) are the coefficients in the ANOVA (or OLS) for each one of the treatments, which are simply the means for each treatment. For instance, in the boxplot above you can see how all the medians happen to be positive. On the dispersion along the y-axis of the dots in each category you see the reflexion of the spread in each one of the boxplots, for instance, notice the spread of treatment 3 (green).

In your plots above, you have depicted (among other things) the residuals v the measurements, instead of the fit. Logically, the farther away from zero (negative or positive) the measurements, the farther they will tend to be away from the mean (which globally we set up at zero), and hence, you end up with approximately diagonal lines.

One final point. In R you can get these plots by calling `plot(anova_model), although you already managed to generate a prettier one with ggplot:

So there are no patterns in these residuals, given that we have centered the data at zero, and produced the points drawing from a normal distribution. In this simple case with categorical variables, the residuals will behave accordingly, and only their small samples will account for the variability across treatments.

If you were to increase the number of data points to $50$ per group, any suspicion of heteroscedasticity would go away:

Related Solutions

Residual Plots – Why Plot Versus Fitted Values Not Observed Y Values?

By construction the error term in an OLS model is uncorrelated with the observed values of the X covariates. This will always be true for the observed data even if the model is yielding biased estimates that do not reflect the true values of a parameter because an assumption of the model is violated (like an omitted variable problem or a problem with reverse causality). The predicted values are entirely a function of these covariates so they are also uncorrelated with the error term. Thus, when you plot residuals against predicted values they should always look random because they are indeed uncorrelated by construction of the estimator. In contrast, it's entirely possible (and indeed probable) for a model's error term to be correlated with Y in practice. For example, with a dichotomous X variable the further the true Y is from either E(Y | X = 1) or E(Y | X = 0) then the larger the residual will be. Here is the same intuition with simulated data in R where we know the model is unbiased because we control the data generating process:

rm(list=ls())
set.seed(21391209)

trueSd <- 10
trueA <- 5
trueB <- as.matrix(c(3,5,-1,0))
sampleSize <- 100

# create independent x-values
x1 <- rnorm(n=sampleSize, mean = 0, sd = 4)
x2 <-  rnorm(n=sampleSize, mean = 5, sd = 10)
x3 <- 3 + x1 * 4 + x2 * 2 + rnorm(n=sampleSize, mean = 0, sd = 10)
x4 <- -50 + x1 * 7 + x2 * .5 + x3 * 2  + rnorm(n=sampleSize, mean = 0, sd = 20)
X = as.matrix(cbind(x1,x2,x3,x4))


# create dependent values according to a + bx + N(0,sd)
Y <-  trueA +  X %*%  trueB  +rnorm(n=sampleSize,mean=0,sd=trueSd)


df = as.data.frame(cbind(Y,X))
colnames(df) <- c("y", "x1", "x2", "x3", "x4")
ols = lm(y~x1+x2+x3+x4, data = df)
y_hat = predict(ols, df)
error = Y - y_hat
cor(y_hat, error) #Zero
cor(Y, error) #Not Zero

We get the same result of zero correlation with a biased model, for example if we omit x1.

ols2 = lm(y~x2+x3+x4, data = df)
y_hat2 = predict(ols2, df)
error2 = Y - y_hat2
cor(y_hat2, error2) #Still zero
cor(Y, error2) #Not Zero

Solved – How to test a nonlinear vs a linear regression model

This is perhaps more of a comment than an answer, but I am not allowed to comment. This comment is meant to be complementary to the existing answer and comments.

Nonlinear regression (least squares) model is generally taken to mean that the model is nonlinear in the parameters (nonlinear in at least one parameter anyway). As exemplified in Ekaba Bisong's answer, another possibility is to have a linear regression model having terms which are nonlinear in one or more independent variables, but linear in the parameters. Either way may fit a nonlinear relationship between the dependent variable and one or more independent variables. Therefore, linear regression vs. nonlinear regression is not even the right way of thinking about it. Rather, linear vs. nonlinear relationship between dependent variable and independent variables is what you really "want" to be asking.

A nonlinear regression model allows for additional flexibility in the form of nonlinear relationship between the dependent variable and the independent variables than does use of a linear regression model which adds terms which are nonlinear in the independent variables but linear in the parameters.

One more thing to think about is the probability distribution of the errors. For instance, consider the model $y = a \exp(bx)$. This can be solved as a nonlinear regression model, or the logarithm of both sides can be taken and solved as a linear regression problem $\ln y = \ln a + b x$. These two models are NOT equivalent, despite frequent claims that they are (i.e., that the linear least squares version is "doing" nonlinear least squares). The errors in the linear regression version are the natural logarithms of the errors of the nonlinear least squares version. It can not be the case that both the error and its logarithm are Normally distributed (of course, neither may be), so one or the other may be better on the basis of which has errors closer to being Normally distributed.

Best Answer

Related Solutions

Residual Plots – Why Plot Versus Fitted Values Not Observed Y Values?

Solved – How to test a nonlinear vs a linear regression model

Related Question