Solved – reducible vs irreducible error

errormachine learningmodeling

Been reading some introductory text on statistical modelling and in particular the concepts of reducible and irreducible errors.

As I understand it, reducible error is the bit we can have any hope of minimising by fitting better models or making our models richer by capturing more of the underlying phenomenon.

The thing that confused me was that the author says that the irreducible error could be due to measurement errors or not being able to capture all the features of interest to model the underlying phenomenon in a better way and due to inherent variability in the data. This last phrase has me confused. I thought the underlying variability would arise due to the measurement error. I am not sure I quite understand what is meant by that phrase.

Also, the issue with not being able to capture all the relevant features. Should this not come under reducible errors as if we did measure these features we could produce a better model perhaps?

Best Answer

Assume for the moment that the world is inherently deterministic -- probably a fine approximation if we're operating at levels above the quantum. Any given outcome $y$ is some unknown function of $\mathbf{X}$. What is in $\mathbf{X}$? Everything. We obviously don't observe everything. Hopefully we observe some of the most important things, and the things we miss are not among the most important. But if we don't observe everything that has an effect, then we simply can't find the deterministic relationship between $X$ and $y$. There will always be noise -- which corresponds to the part of $y$ that is determined by things that we don't observe.

So viewed this way, there is no such thing as inherent variability -- inherent variability is just deterministic movement based on factors that you don't observe.

Concretely: you have data on the sales of a hotdog stand, and your $X$'s include features of that hotdog stand. These can explain much of the stand's sales. But features of the stand won't explain everything. Maybe the weather affects hotdogs. Maybe there was a mustard shortage. Maybe Trump tweeted something nasty abut hotdogs. These are deterministic factors that you don't observe, but which could be interpreted as inherent variability.

From this perspective, measurement error is itself a process, which you don't observe. If you knew what caused the error and could predict the error and put it in your model, there would be nothing inherently variable about it.

Finally, if the number of $X$'s is infinite, and your observations of $y$ are finite, you'll never be able to identify the deterministic relationship, even if you know the proper functional form -- which you won't. There will be different combinations of parameters that product identical outcomes with the same structure.

Related Question