Applications of Truncated Probability Distributions

regression

Recently, I was wondering about how to "restrict" a statistical model from making predictions beyond a certain range (Preventing Illogical Interoperations of Models?).

For example, in this video (https://www.youtube.com/watch?v=h5aPo5wXN8E&list=PLDcUM9US4XdNM4Edgs7weiyIguLSToZRI&index=3 @ 56:40), a Bayesian Model is created using the Log Normal Distribution when modelling human heights as heights can not take negative values.

After spending some more time reading about this, I came across the idea of "Truncated Probability Distributions" (https://en.wikipedia.org/wiki/Truncated_normal_distribution). As I understand, a Truncated Probability Distribution is a Probability Distribution that is defined only on a "limited range" (i.e. "restricted"). For example, consider the Normal Distribution – we can "truncate" this distribution over the range $a – b$:

$$f(x; \mu, \sigma, a, b) = \frac{1}{\sigma} \cdot \frac{\phi\left(\frac{x-\mu}{\sigma}\right)}{\Phi\left(\frac{b-\mu}{\sigma}\right) – \Phi\left(\frac{a-\mu}{\sigma}\right)}$$

Where: $$\phi(x) = \frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}x^2\right)$$

This leads me to my question: Suppose I collect some data on how long different people lived and the average amount of yearly income they earned in their life. Suppose I am interested in modelling (e.g. regression) the effect of income on life expectancy. In this problem, it is quite likely to observe an upwards trend in that people with higher incomes likely had the ability to access better quality healthcare and thus lived longer. However, it is also possible that if I use this model to predict the life expectancy of a billionaire, the life expectancy might be around 200 years – and we know that in modern history, no human has ever recorded to live that long.

Thus, suppose if I found out the maximum age a human ever reached – to avoid making such illogical predictions, could I create a GLM Regression Model based on a Truncated Normal Probability Distribution between $a = 0$ and $b$ = max_age_ever_recorded and thus address this problem of illogical predictions? Is this a statistically valid approach? Or is this illogical or unnecessary?

Thanks!

Best Answer

There's a famous quote from George Box that

All models are wrong, but some are useful.

Sure, you can use truncated distribution or other distribution that out-of-a-box has a restricted range. But what would be the upper bound? If you get it wrong, your model would be wrong as well!

However let's suppose that you didn't restrict the range, so what? Yes, your model could say that the life expectancy as a function of income could be 200 years old for a billionaire, so what? First of all, life expectancy is not a function of income, a billionaire may die as any of us in a scenario where their wealth would not change anything. So your model is obviously wrong, as life expectancy as a function of wealth is not the "true" explanation. The explanation could be useful in some scenarios though while remembering the limited applicability of the model. Truncating the distribution would be just lipsticking the pig.

But, of course, if we have good reasons to use models using things like truncated distributions, we do so. But doing this to cover up the fact that the model does not work for some scenarios is not a good reason. In fact, it may hide the problems with the model giving you a false sense of it working properly. It would only force your linear predictions to fit the square hole.

Related Question