Solved – Assumptions of linear regression and gradient descent

generalized linear modelgradient descentmachine learningmultiple regressionregression

I have been reading on linear regression(from Andrew Ng's lectures and ISLR) and estimating the coefficients using gradient descent. This is what I've understood of gradient descent

  • Include a dummy variable value of which is one throughout all sample points so that we can get the intercept value.
  • assign weights to the variables randomly to each of the variables and make predictions according to the weights
  • Using a cost function(squared error for example), compute the loss and the value of the derivative of the cost function with respect to the weights(the gradient)
  • adjust the weights by subtracting its gradient (times the learning rate) from the weight
  • iterate until the change in weights is insignificant or some number of iterations have been done

Now coming to the assumptions of linear regression, no where in this whole process we've had to assume anything like the normality and constant variance of errors, auto-correlation or independence of features. I agree that if the response variable is linearly related to the predictor variables the model fit will be better. But what of the other assumptions? My question is where do these assumptions stem from? What's the basis/justification for making these assumptions?

Best Answer

The typical assumptions about the validity of linear regression don't really have much to do with how you optimize the model; they're about whether the learned model will be "right." That is:

  • Given some data, you can always run a linear regression model, get some coefficients out, and use them to make predictions.
  • For the estimated coefficients to approximate the "true" coefficients well, we need to assume both that these "true" coefficients exist and some things about the distribution of data that we see and whatnot.
  • Similarly, for the predictions to converge to "the best possible predictions," we need to make some of the same assumptions about what the distributions look like.