Solved – Inverse CDF sampling for a mixed distribution

copuladistributionsrsamplingsimulation

The out-of-context short version

Let $y$ be a random variable with CDF
$$
F(\cdot) \equiv \cases{\theta & y = 0 \\ \theta + (1-\theta) \times \text{CDF}_{\text{log-normal}}(\cdot; \mu, \sigma) & y > 0}
$$

Let's say I wanted to simulate draws of $y$ using the inverse CDF method. Is that possible? This function doesn't exactly have an inverse. Then again there's Inverse transformation sampling for mixture distribution of two normal distributions which suggests that there is a known way to apply inverse transformation sampling here.

I'm aware of the two-step method, but I don't know how to apply it to my situation (see below).


The long version with background

I fitted the following model for a vector-valued response, $y^i = \left( y_1 , \dots , y_K \right)^i$, using MCMC (specifically, Stan):

$$
\theta_k^i \equiv \operatorname{logit}^{-1}\left( \alpha_k x^i \right), \quad \mu_k^i \equiv \beta_k x^i – \frac{ \sigma^2_k }{ 2 } \\
F(\cdot) \equiv \cases{\theta & y = 0 \\ \theta + (1-\theta) \times \text{CDF}_{\text{log-normal}}(\cdot; \mu, \sigma) & y > 0} \\
u_k \equiv F(y_k), \quad z_k \equiv\Phi^{-1}{\left( u_k \right)} \\
z \sim \mathcal{N}(\mathbf{0}, R) \times \prod_k f(y_k) \\
\left( \alpha, \beta, \sigma, R \right) \sim \text{priors}
$$

where $i$ indexes $N$ observations, $R$ is a correlation matrix, and $x$ is a vector of predictors/regressors/features.

That is, my model is a regression model in which the conditional distribution of the response is assumed to be a Gaussian copula with zero-inflated log-normal marginals. I've posted about this model before; it turns out that Song, Li, and Yuan (2009, gated) have developed it and they call it a vector GLM, or VGLM. The following is their specification as close to verbatim as I could get it:
$$
f(\mathbf{y}; \mathbf{\mu}, \mathbf{\varphi}, \Gamma) = c\{ G_1(y_1), \dots, G_m(y_m) | \Gamma \} \prod_{i=1}^m g(y_i; \mu_i, \varphi_i) \\
c(\mathbf{u} | \Gamma) = \left| \Gamma \right|^{-1/2}\exp\left( \frac{1}{2} \mathbf{q}^T \left( I_m – \Gamma^{-1} \right) \mathbf{q} \right) \\
\mathbf{q} = \left( q_1, \dots, q_m \right)^T, \quad q_i = \Phi^{-1}(u_i)
$$
My $F_K$ corresponds to their $G_m$, my $z$ corresponds to their $\mathbf{q}$, and my $R$ corresponds to their $\Gamma$; the details are on page 62 (page 3 of the PDF file) but they're otherwise identical to what I wrote here.

The zero-inflated part roughly follows the specification of Liu and Chan (2010, ungated).

Now I would like to simulate data from the estimated parameters, but I'm a little confused as to how to go about it. First I thought I could just simulate $y$ directly (in R code):

for (i in 1:N) {
    for (k in 1:K) {
        Y_hat <- rbinom(1, 1, 1 - theta[i, k])
        if (Y_hat == 1)
            Y_hat <- rlnorm(1, mu[i, k], sigma[k])
    }
}

which doesn't use $R$ at all. I'd like to try to actually use the correlation matrix I estimated.

My next idea was to take draws of $z$ and then convert them back to $y$. This also seems to coincide with the answers in Generating samples from Copula in R and Bivariate sampling for distribution expressed in Sklar's copula theorem?. But what the heck is my $F^{-1}$ here? Inverse transformation sampling for mixture distribution of two normal distributions makes it sound like this is possible, but I have no idea how to do it.

Best Answer

The answer to the long version with background:

This answer to the long version somewhat addresses another issue and, since we seem to have difficulties formulating the model and the problem, I choose to rephrase it here, hopefully correctly.

For $1\le i\le I$, the goal is to simulate vectors $y^i=(y^i_1,\ldots,y^i_K)$ such that, conditional on a covariate $x^i$, $$ y_k^i = \begin{cases} 0 &\text{ with probability }\operatorname{logit}^{-1}\left( \alpha_k x^i \right)\\ \log(\sigma_k z_k^i + \beta_k x^i) &\text{ with probability }1-\operatorname{logit}^{-1}\left( \alpha_k x^i \right)\end{cases} $$ with $z^i=(z^i_1,\ldots,z^i_K)\sim\mathcal{N}_K(0,R)$. Hence, if one wants to simulate data from this model, one could proceed as follows:

For $1\le i\le I$,

  1. Generate $z^i=(z^i_1,\ldots,z^i_K)\sim\mathcal{N}_K(0,R)$
  2. Generate $u^i_1,\ldots,u^i_K \stackrel{\text{iid}}{\sim} \mathcal{U}(0,1)$
  3. Derive $y^i_k=\mathbb{I}\left\{u^i_k>\operatorname{logit}^{-1}\left( \alpha_k x^i \right)\right\}\, \log\{ \sigma_k z_k^i + \beta_k x^i\}$ for $1\le k\le K$

If one is interested in the generation from the posterior of $(\alpha,\beta,\mu,\sigma,R)$ given the $y^i_ k$, this is a harder problem, albeit feasible by Gibbs sampling or ABC.

Related Question