[Math] Residual analysis in Python

regressionregression analysis

when doing residual analysis do we first fit our model on our entire training set and calculate residuals between fitted values and actual values? Or do we first fit our model on the training+testing set?

I'm having trouble wrapping my head around the concept of residual variance. Does the variance mean that if we fit our linear regression model on multiple (varying) datasets, our residuals would vary according to the normal distribution with mean 0 and this variance?

When would we use prediction vs estimation? Predictions have more variance because of new data point, but it seems that we are always estimating/predicting new data points?

How do you deal with leverage points?

Does anybody know any good Python packages to do residual analysis?

Best Answer

You shouldn't ask so many questions at once. That said:

  • Fit a model on the training set, then see how well this fit performs on the testing set.
  • Your conjectured meaning of residual variance is incorrect. See the explanation here in terms of Bessel's correction. In short, residuals are how wrong the line of best fit is in its estimates, and the residuals have a sample variance. But this is all done with the one dataset used to fit the model.
  • I'm not sure what you're asking about re: prediction v estimation. Bear in mind one estimates the coefficients in a regression, but having done so predicts $y$ values from $x$ values.
  • Points of high leverage reduce the noise in residuals. One way to deal with them is to Studentize them, which recognises residuals' heteroscedasticity. There's an outside chance what you really wanted to know was how to deal with outliers, which is a complicated issue.
  • You can analyse residuals just with Numpy.
Related Question