Solved – Inverse CDF sampling for a mixed distribution

copuladistributionsrsamplingsimulation

The out-of-context short version

Let $y$ be a random variable with CDF
$$
F(\cdot) \equiv \cases{\theta & y = 0 \\ \theta + (1-\theta) \times \text{CDF}_{\text{log-normal}}(\cdot; \mu, \sigma) & y > 0}
$$

Let's say I wanted to simulate draws of $y$ using the inverse CDF method. Is that possible? This function doesn't exactly have an inverse. Then again there's Inverse transformation sampling for mixture distribution of two normal distributions which suggests that there is a known way to apply inverse transformation sampling here.

I'm aware of the two-step method, but I don't know how to apply it to my situation (see below).

The long version with background

I fitted the following model for a vector-valued response, $y^i = \left( y_1 , \dots , y_K \right)^i$, using MCMC (specifically, Stan):

$$
\theta_k^i \equiv \operatorname{logit}^{-1}\left( \alpha_k x^i \right), \quad \mu_k^i \equiv \beta_k x^i – \frac{ \sigma^2_k }{ 2 } \\
F(\cdot) \equiv \cases{\theta & y = 0 \\ \theta + (1-\theta) \times \text{CDF}_{\text{log-normal}}(\cdot; \mu, \sigma) & y > 0} \\
u_k \equiv F(y_k), \quad z_k \equiv\Phi^{-1}{\left( u_k \right)} \\
z \sim \mathcal{N}(\mathbf{0}, R) \times \prod_k f(y_k) \\
\left( \alpha, \beta, \sigma, R \right) \sim \text{priors}
$$

where $i$ indexes $N$ observations, $R$ is a correlation matrix, and $x$ is a vector of predictors/regressors/features.

That is, my model is a regression model in which the conditional distribution of the response is assumed to be a Gaussian copula with zero-inflated log-normal marginals. I've posted about this model before; it turns out that Song, Li, and Yuan (2009, gated) have developed it and they call it a vector GLM, or VGLM. The following is their specification as close to verbatim as I could get it:
$$
f(\mathbf{y}; \mathbf{\mu}, \mathbf{\varphi}, \Gamma) = c\{ G_1(y_1), \dots, G_m(y_m) | \Gamma \} \prod_{i=1}^m g(y_i; \mu_i, \varphi_i) \\
c(\mathbf{u} | \Gamma) = \left| \Gamma \right|^{-1/2}\exp\left( \frac{1}{2} \mathbf{q}^T \left( I_m – \Gamma^{-1} \right) \mathbf{q} \right) \\
\mathbf{q} = \left( q_1, \dots, q_m \right)^T, \quad q_i = \Phi^{-1}(u_i)
$$
My $F_K$ corresponds to their $G_m$, my $z$ corresponds to their $\mathbf{q}$, and my $R$ corresponds to their $\Gamma$; the details are on page 62 (page 3 of the PDF file) but they're otherwise identical to what I wrote here.

The zero-inflated part roughly follows the specification of Liu and Chan (2010, ungated).

Now I would like to simulate data from the estimated parameters, but I'm a little confused as to how to go about it. First I thought I could just simulate $y$ directly (in R code):

for (i in 1:N) {
    for (k in 1:K) {
        Y_hat <- rbinom(1, 1, 1 - theta[i, k])
        if (Y_hat == 1)
            Y_hat <- rlnorm(1, mu[i, k], sigma[k])
    }
}

which doesn't use $R$ at all. I'd like to try to actually use the correlation matrix I estimated.

My next idea was to take draws of $z$ and then convert them back to $y$. This also seems to coincide with the answers in Generating samples from Copula in R and Bivariate sampling for distribution expressed in Sklar's copula theorem?. But what the heck is my $F^{-1}$ here? Inverse transformation sampling for mixture distribution of two normal distributions makes it sound like this is possible, but I have no idea how to do it.

Best Answer

The answer to the long version with background:

This answer to the long version somewhat addresses another issue and, since we seem to have difficulties formulating the model and the problem, I choose to rephrase it here, hopefully correctly.

For $1\le i\le I$, the goal is to simulate vectors $y^i=(y^i_1,\ldots,y^i_K)$ such that, conditional on a covariate $x^i$, $$ y_k^i = \begin{cases} 0 &\text{ with probability }\operatorname{logit}^{-1}\left( \alpha_k x^i \right)\\ \log(\sigma_k z_k^i + \beta_k x^i) &\text{ with probability }1-\operatorname{logit}^{-1}\left( \alpha_k x^i \right)\end{cases} $$ with $z^i=(z^i_1,\ldots,z^i_K)\sim\mathcal{N}_K(0,R)$. Hence, if one wants to simulate data from this model, one could proceed as follows:

For $1\le i\le I$,

Generate $z^i=(z^i_1,\ldots,z^i_K)\sim\mathcal{N}_K(0,R)$
Generate $u^i_1,\ldots,u^i_K \stackrel{\text{iid}}{\sim} \mathcal{U}(0,1)$
Derive $y^i_k=\mathbb{I}\left\{u^i_k>\operatorname{logit}^{-1}\left( \alpha_k x^i \right)\right\}\, \log\{ \sigma_k z_k^i + \beta_k x^i\}$ for $1\le k\le K$

If one is interested in the generation from the posterior of $(\alpha,\beta,\mu,\sigma,R)$ given the $y^i_ k$, this is a harder problem, albeit feasible by Gibbs sampling or ABC.

Related Solutions

Solved – Advantages of Box-Muller over inverse CDF method for simulating Normal distribution

From a purely probabilistic perspective, both approaches are correct and hence equivalent. From an algorithmic perspective, the comparison must consider both the precision and the computing cost.

Box-Muller relies on a uniform generator and costs about the same as this uniform generator. As mentioned in my comment, you can get away without sine or cosine calls, if not without the logarithm:

generate $$U_1,U_2\stackrel{\text{iid}}{\sim}\mathcal{U}(-1,1)$$ until $S=U_1^2+U_2^2\le 1$
take $Z=\sqrt{-2\log(S)/S}$ and define $$X_1=ZU_1\,,\ X_2=Z U_2$$

The generic inversion algorithm requires the call to the inverse normal cdf, for instance qnorm(runif(N)) in R, which may be more costly than the above and more importantly may fail in the tails in terms of precision, unless the quantile function is well-coded.

To follow on comments made by whuber, the comparison of rnorm(N)and qnorm(runif(N))is at the advantage of the inverse cdf, both in execution time:

> system.time(qnorm(runif(10^8)))
sutilisateur     système      écoulé
 10.137           0.120      10.251 
> system.time(rnorm(10^8))
utilisateur     système      écoulé
 13.417           0.060      13.472` `

using the more accurate R benchmark

        test replications elapsed relative user.self sys.self
3 box-muller          100   0.103    1.839     0.103        0
2    inverse          100   0.056    1.000     0.056        0
1    R rnorm          100   0.064    1.143     0.064        0

and in terms of fit in the tail: enter image description here

Following a comment of Radford Neal on my blog, I want to point out that the default rnorm in R makes use of the inversion method, hence that the above comparison is reflecting on the interface and not on the simulation method itself! To quote the R documentation on RNG:

‘normal.kind’ can be ‘"Kinderman-Ramage"’, ‘"Buggy
 Kinderman-Ramage"’ (not for ‘set.seed’), ‘"Ahrens-Dieter"’,
 ‘"Box-Muller"’, ‘"Inversion"’ (the default), or ‘"user-supplied"’.
 (For inversion, see the reference in ‘qnorm’.)  The
 Kinderman-Ramage generator used in versions prior to 1.7.1 (now
 called ‘"Buggy"’) had several approximation errors and should only
 be used for reproduction of old results.  The ‘"Box-Muller"’
 generator is stateful as pairs of normals are generated and
 returned sequentially.  The state is reset whenever it is selected
 (even if it is the current normal generator) and when ‘kind’ is
 changed.

Solved – Simple Student Copula simulation

The correct way to obtain a simulation from a multivariate t-distribution is to simulate from a correlated multivariate normal and then divide each component by the same denominator.

If you then convert each component individually to a uniform, you have a sample from a t-copula.