Solved – Predict probabilities for continuous variable

categorical datacontinuous dataprobabilityregression

Usually, with a continuous dependent variable, we can apply linear regression and then predict values based on new data.

For instance, defaults on loans: let's say we know an individual will default on his loan, and we want to estimate how long it takes him to default (1 year, 2 years, 3 years… after he took the loan).

With linear regression, we can predict for a new individual that, based on his characteristics, he will default after X years.

But what I'm looking for is a model which will give me probabilities for each of the values.

Here, it would be: for a new individual that we know is going to default, what is the probability he will default after 1 year vs the probability he will default after 2 years…

One possibility would be to consider that the dependent variable is categorical, and regress a logit / probit model to get probabilities.

But 1) there is some loss of information. Multinomial logit does not consider the categories as related. At best, ordered logit will order them. But we still don't take into account the increment is the same between all categories (1 year).

And 2) if we want to consider defaults on more than a few years, the number of categories of the dependent variable quickly increases, which will affect the performance of the predictions.

So if anyone has an idea on how to tackle this problem, I'd like to know your thoughts! I feel like I'm not approaching it right at the moment, and maybe I need another kind of modelisation altogether.

Thank you very much !

Best Answer

If you want to predict things like probability of default as a function of time, then you are interested in survival analysis models, so check the questions tagged as .

As about your general question, with binary data we use logistic regression that enables us to predict the probability of success by assuming Bernoulli distribution, with multiple categories we assume multinomial distribution, and for continuous data, we assume an appropriate continuous distribution. In case of linear regression, the probabilistic model behind it assumes normal distribution, so if know the parameters of the distribution, you can estimate the probability densities for a particular outcome, given the estimated parameters. Same with other distributions, so basically the all you need is a probabilistic model.

Related Question