Solved – How to predict a distribution (from a set of predictors) that I can simulate from

machine learningquantile regressionsamplingsimulation

Let's say I have the following regression problem:

Given a person's age and height, I want to predict how many years they've spent playing basketball. However, instead of just regressing on these features, I'd like to actually predict a distribution that I can simulate from: for each sample, I'd like to predict a distribution (for instance, predict the parameters for the normal distribution) for each specific test sample, that I can then draw samples from to simulate on.

More explicit example: after training my model, if I have a test sample with [age=25, height=72"], I'd like to predict the distribution for how many years that person has spent playing basketball, so I can draw samples from that predicted distribution for simulation purposes. Seems similar to quantile regression but I'd like to predict an explicit distribution that I can sample from…

Any tips for how to go about solving this?

Best Answer

You can do this with linear regression! Whether you use the MLE or minimize the mean squared error, you are actually modeling the mean response, conditional on the predictors, and you are implicitly assuming that the response itself is normally distributed around that mean. For other distributions, you would use a generalized linear model.

What you're looking for also arises naturally as the "posterior predictive distribution" in a Bayesian model. It might be helpful to think of maximum-likelihood-based inference as Bayesian inference with flat priors.

Fully nonparametric density estimation is very difficult in general except in low-dimensional cases. I asked an involved question about it about a year ago, but I can't find it now.

Related Solutions

Solved – Simulate from a truncated mixture normal distribution

Simulation from a truncated normal is easily done if you have access to a proper normal quantile function. For instance, in R, simulating $$ \mathcal{N}_a^b(\mu,\sigma^2)$$where $a$ and $b$ denote the lower and upper bounds can be done by inverting the cdf $$\dfrac{\Phi(\sigma^{-1}\{x-\mu\})-\Phi(\sigma^{-1}\{a-\mu\})}{\Phi(\sigma^{-1}\{b-\mu\})-\Phi(\sigma^{-1}\{a-\mu\})} $$ e.g., in R

x = mu + sigma * qnorm( pnorm(a,mu,sigma) + 
     runif(1)*(pnorm(b,mu,sigma) - pnorm(a,mu,sigma)) )

Otherwise, I developed a truncated normal accept-reject algorithm twenty years ago.

If we consider the truncated mixture problem, with density $$ f(x;\theta) \propto \left\{p\varphi(x;\mu_1,\sigma_1)+(1-p)\varphi(x;\mu_2,\sigma_2)\right\}\mathbb{I}_{[a,b]}(x) $$ it is a mixture of truncated normal distributions but with different weights: $$ f(x;\theta) \propto p\left\{\Phi(\sigma_1^{-1}\{b-\mu_1\})-\Phi(\sigma_1^{-1}\{a-\mu_1\}) \right\}\dfrac{\sigma_1^{-1}\phi(\sigma_1^{-1}\{x-\mu_1\})}{\Phi(\sigma_1^{-1}\{b-\mu_1\})-\Phi(\sigma_1^{-1}\{a-\mu_1\})} \\[15pt] +(1-p)\left\{\Phi(\sigma_2^{-1}\{b-\mu_2\})-\Phi(\sigma_2^{-1}\{a-\mu_2\}) \right\}\dfrac{\sigma_2^{-1}\phi(\sigma_2^{-1}\{x-\mu_2\})}{\Phi(\sigma_2^{-1}\{b-\mu_2\})-\Phi(\sigma_1^{-1}\{a-\mu_2\})} $$ Therefore, to simulate from a truncated normal mixture, it is sufficient to take $$x=\begin{cases} x_1\sim\mathcal{N}_a^b(\mu_1,\sigma_1^2) &\text{with probability }\\ &\qquad p\left\{\Phi(\sigma_1^{-1}\{b-\mu_1\})-\Phi(\sigma_1^{-1}\{a-\mu_1\}) \right\}\big/\mathfrak{s}\\ x_2\sim\mathcal{N}_a^b(\mu_2,\sigma_2^2) &\text{with probability }\\ &\qquad(1-p)\left\{\Phi(\sigma_2^{-1}\{b-\mu_2\})-\Phi(\sigma_2^{-1}\{a-\mu_2\}) \right\}\big/\mathfrak{s} \end{cases} $$ where \begin{align} \mathfrak{s}=&p\left\{\Phi(\sigma_1^{-1}\{b-\mu_1\})-\Phi(\sigma_1^{-1}\{a-\mu_1\}) \right\}+ \\ &(1-p)\left\{\Phi(\sigma_2^{-1}\{b-\mu_2\})-\Phi(\sigma_2^{-1}\{a-\mu_2\}) \right\} \end{align}

Poisson Regression – How to Simulate from a Zero-Inflated Poisson Distribution

You can get the probability of zero-inflation by

p <- predict(object, ..., type = "zero")

and the mean of the count distribution by

lambda <- predict(object, ..., type = "count")

See Appendix C of vignette("countreg", package = "pscl") for a few more details.

To simulate the distribution, you can either do it manually with

ifelse(rbinom(n, size = 1, prob = p) > 0, 0, rpois(n, lambda = lambda))

or you can use rzipois() from the VGAM package

library("VGAM")
rzipois(n, lambda = lambda, pstr0 = p)

which essentially also does an ifelse() as above but adds a few sanity checks etc.

Best Answer

Related Solutions

Solved – Simulate from a truncated mixture normal distribution

Poisson Regression – How to Simulate from a Zero-Inflated Poisson Distribution

Related Question