Solved – Why do we assume that the error is normally distributed

normality-assumptionpac-learningregression

I wonder why do we use the Gaussian assumption when modelling the error. In Stanford's ML course, Prof. Ng describes it basically in two manners:

  1. It is mathematically convenient. (It's related to Least Squares fitting and easy to solve with pseudoinverse)
  2. Due to the Central Limit Theorem, we may assume that there are lots of underlying facts affecting the process and the sum of these individual errors will tend to behave like in a zero mean normal distribution. In practice, it seems to be so.

I'm interested in the second part actually. The Central Limit Theorem works for iid samples as far as I know, but we can not guarantee the underlying samples to be iid.

Do you have any ideas about the Gaussian assumption of the error?

Best Answer

I think you've basically hit the nail on the head in the question, but I'll see if I can add something anyway. I'm going to answer this in a bit of a roundabout way ...

The field of Robust Statistics examines the question of what to do when the Gaussian assumption fails (in the sense that there are outliers):

it is often assumed that the data errors are normally distributed, at least approximately, or that the central limit theorem can be relied on to produce normally distributed estimates. Unfortunately, when there are outliers in the data, classical methods often have very poor performance

These have been applied in ML too, for example in Mika el al. (2001) A Mathematical Programming Approach to the Kernel Fisher Algorithm, they describe how Huber's Robust Loss can be used with KDFA (along with other loss functions). Of course this is a classification loss, but KFDA is closely related to the Relevance Vector Machine (see section 4 of the Mika paper).

As implied in the question, there is a close connection between loss functions and Bayesian error models (see here for a discussion).

However it tends to be the case that as soon as you start incorporating "funky" loss functions, optimisation becomes tough (note that this happens in the Bayesian world too). So in many cases people resort to standard loss functions that are easy to optimise, and instead do extra pre-processing to ensure that the data conforms to the model.

The other point that you mention is that the CLT only applies to samples that are IID. This is true, but then the assumptions (and the accompanying analysis) of most algorithms is the same. When you start looking at non-IID data, things get a lot more tricky. One example is if there is temporal dependence, in which case typically the approach is to assume that the dependence only spans a certain window, and samples can therefore be considered approximately IID outside of this window (see for example this brilliant but tough paper Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes), after which normal analysis can be applied.

So, yes, it comes down in part to convenience, and in part because in the real world, most errors do look (roughly) Gaussian. One should of course always be careful when looking at a new problem to make sure that the assumptions aren't violated.

Related Question