Solved – How to predict a distribution (from a set of predictors) that I can simulate from

machine learningquantile regressionsamplingsimulation

Let's say I have the following regression problem:

Given a person's age and height, I want to predict how many years they've spent playing basketball. However, instead of just regressing on these features, I'd like to actually predict a distribution that I can simulate from: for each sample, I'd like to predict a distribution (for instance, predict the parameters for the normal distribution) for each specific test sample, that I can then draw samples from to simulate on.

More explicit example: after training my model, if I have a test sample with [age=25, height=72"], I'd like to predict the distribution for how many years that person has spent playing basketball, so I can draw samples from that predicted distribution for simulation purposes. Seems similar to quantile regression but I'd like to predict an explicit distribution that I can sample from…

Any tips for how to go about solving this?

Best Answer

You can do this with linear regression! Whether you use the MLE or minimize the mean squared error, you are actually modeling the mean response, conditional on the predictors, and you are implicitly assuming that the response itself is normally distributed around that mean. For other distributions, you would use a generalized linear model.

What you're looking for also arises naturally as the "posterior predictive distribution" in a Bayesian model. It might be helpful to think of maximum-likelihood-based inference as Bayesian inference with flat priors.

Fully nonparametric density estimation is very difficult in general except in low-dimensional cases. I asked an involved question about it about a year ago, but I can't find it now.

Related Question