Solved – Is the difference between the residual and error term in a regression just the ability to observe it

errorregressionresiduals

According to what I read online, the error term and the residual are often interchangeable. Please let me know if my understanding below is correct:

However, the difference is that the error term is the difference between our predicted value and the ACTUAL population value. So for instance, if we are to measure the relationship between salary level and experience level for a 35 year old male in the US, we wouldn't be able to get the data for ALL of the millions of 35 year old males in the US. However, we can get a sample population of that whole population. So therefore, our regression of Y would include an error term which indicates the value of what we have and the ACTUAL value of the population that we CANNOT OBSERVE. However, the RESIDUAL is the difference between the actual regression line and the actual POINTS of the scatter plot of data that we have that we have actually OBSERVED from going out and collecting data from some source we may have.

Is my understanding correct?

Best Answer

Your wording seems to imply that the error term exists because we deal with samples and the error term captures information about the non-sampled part of the population. That's not correct. The error term in a regression model represents factors other than the observed variables included in the model as $X$'s (explanatory/independent variables) that affect the dependent variable $Y$. Regression model (e.g., $y = \beta_{0} + \beta_{1}x + \epsilon$) begins from assuming what the relationship between $X$ and $Y$ variables is in the population, so the error term exists even in the population model. The model you end up estimating with sample data allows you to estimate the parameters of that population model.

So the error term is NOT the difference between observed and predicted values of $Y$. Repeating myself, it represents unobserved factors affecting $Y$.

Once the population regression model is assumed, we proceed to estimating the model with randomly sampled data. The estimation/fitting procedure we use estimates the values of $\beta$'s and we can then compute predicted/fitted values of $Y$ based on those estimated values of $\beta$'s and the observed values of $X$'s. The estimated regression equation takes on the form: $y_{i} = \hat\beta_{0} +\hat\beta_{1}x_{i} + \hat\epsilon_{i}$, with those hats denoting estimated values. $\hat\epsilon_{i}$'s are the residuals in the estimated equation (differences between observed and predicted values of $Y$ for each individual in the sample), while $\epsilon$'s are the errors in the equation containing population parameters $\beta_{0}$ and $\beta_{1}$. The errors are not observable, while the residuals are computed from the data.

Related Solutions

Solved – Difference between Norm of Residuals and what is a “good” Norm of Residual

So, I would recommend using standard method for comparing nested models. In your case, you consider two alternative models, the cubic fit being the more "complex" one. An F- or $\chi^2$-test tells you whether the residual sum of squares or deviance significantly decrease when you add further terms. It is very like comparing a model including only the intercept (in this case, you have residual variance only) vs. another one which include one meaningful predictor: does this added predictor account for a sufficient part of the variance in the response? In your case, it amounts to say: Modeling a cubic relationship between X and Y decreases the unexplained variance (equivalently, the $R^2$ will increase), and thus provide a better fit to the data compared to a linear fit.

It is often used as a test of linearity between the response variable and the predictor, and this is the reason why Frank Harrell advocates the use of restricted cubic spline instead of assuming a strict linear relationship between Y and the continuous Xs (e.g. age).

The following example comes from a book I was reading some months ago (High-dimensional data analysis in cancer research, Chap. 3, p. 45), but it may well serves as an illustration. The idea is just to fit different kind of models to a simulated data set, which clearly highlights a non-linear relationship between the response variable and the predictor. The true generative model is shown in black. The other colors are for different models (restricted cubic spline, B-spline close to yours, and CV smoothed spline).

library(rms)
library(splines)
set.seed(101)
f <- function(x) sin(sqrt(2*pi*x))
n <- 1000
x <- runif(n, 0, 2*pi)
sigma <- rnorm(n, 0, 0.25)
y <- f(x) + sigma
plot(x, y, cex=.4)
curve(f, 0, 6, lty=2, add=TRUE)
# linear fit
lm00 <- lm(y~x)
# restricted cubic spline, 3 knots (2 Df)
lm0 <- lm(y~rcs(x,3))
lines(seq(0,6,length=1000), 
      predict(lm0,data.frame(x=seq(0,6,length=1000))),
      col="red")
# use B-spline and a single knot at x=1.13 (4 Df)
lm1 <- lm(y~bs(x, knots=1.13))
lines(seq(0,6,length=1000),
      predict(lm1,data.frame(x=seq(0,6,length=1000))), 
      col="green")
# cross-validated smoothed spline (approx. 20 Df)
xy.spl <- smooth.spline(x, y, cv=TRUE)
lines(xy.spl, col="blue")
legend("bottomleft", c("f(x)","RCS {rms}","BS {splines}","SS {stats}"), 
       col=1:4, lty=c(2,rep(1,3)),bty="n", cex=.6)

exmaple

Now, suppose you want to compare the linear fit (lm00) and model relying on B-spline (lm1), you just have to do an F-test to see that the latter provides a better fit:

> anova(lm00, lm1)
Analysis of Variance Table

Model 1: y ~ x
Model 2: y ~ bs(x, knots = 1.13)
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1    998 309.248                                  
2    995  63.926  3    245.32 1272.8 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Likewise, it is quite usual to compare GLM with GAM based on the results of a $\chi^2$-test.

Solved – the conceptual difference between Residual Sum of Squares (RSS) and Residual Standard Error (RSE)

Residual sum of squares (RSS) is defined as $ \sum_{i=1}^{n} e_i^2 $, where $e_i$ is the residual. Residual Standard Error (RSE) is derived from RSS, and is defined as $\sqrt{RSS/(n-2)}$ where $n$ is the sample size. RSS shows the aggregate squared deviation around the fitted line, and since it is squared it is not on the same scale as your dependent variable. RSE, on the other hand is on the same scale as your dependent variable. It serves as an estimate of the standard deviation of the dependent variable.

Best Answer

Related Solutions

Solved – Difference between Norm of Residuals and what is a “good” Norm of Residual

Solved – the conceptual difference between Residual Sum of Squares (RSS) and Residual Standard Error (RSE)

Related Question