Solved – Is the difference between the residual and error term in a regression just the ability to observe it

errorregressionresiduals

According to what I read online, the error term and the residual are often interchangeable. Please let me know if my understanding below is correct:

However, the difference is that the error term is the difference between our predicted value and the ACTUAL population value. So for instance, if we are to measure the relationship between salary level and experience level for a 35 year old male in the US, we wouldn't be able to get the data for ALL of the millions of 35 year old males in the US. However, we can get a sample population of that whole population. So therefore, our regression of Y would include an error term which indicates the value of what we have and the ACTUAL value of the population that we CANNOT OBSERVE. However, the RESIDUAL is the difference between the actual regression line and the actual POINTS of the scatter plot of data that we have that we have actually OBSERVED from going out and collecting data from some source we may have.

Is my understanding correct?

Best Answer

Your wording seems to imply that the error term exists because we deal with samples and the error term captures information about the non-sampled part of the population. That's not correct. The error term in a regression model represents factors other than the observed variables included in the model as $X$'s (explanatory/independent variables) that affect the dependent variable $Y$. Regression model (e.g., $y = \beta_{0} + \beta_{1}x + \epsilon$) begins from assuming what the relationship between $X$ and $Y$ variables is in the population, so the error term exists even in the population model. The model you end up estimating with sample data allows you to estimate the parameters of that population model.

So the error term is NOT the difference between observed and predicted values of $Y$. Repeating myself, it represents unobserved factors affecting $Y$.

Once the population regression model is assumed, we proceed to estimating the model with randomly sampled data. The estimation/fitting procedure we use estimates the values of $\beta$'s and we can then compute predicted/fitted values of $Y$ based on those estimated values of $\beta$'s and the observed values of $X$'s. The estimated regression equation takes on the form: $y_{i} = \hat\beta_{0} +\hat\beta_{1}x_{i} + \hat\epsilon_{i}$, with those hats denoting estimated values. $\hat\epsilon_{i}$'s are the residuals in the estimated equation (differences between observed and predicted values of $Y$ for each individual in the sample), while $\epsilon$'s are the errors in the equation containing population parameters $\beta_{0}$ and $\beta_{1}$. The errors are not observable, while the residuals are computed from the data.

Related Question