Obtain $p(x)$ given samples from $p(y|x)$ and $p(y)$

bayesianconditional probabilitymarginal-distributionsampling

Here, assume both $p(y\mid x)$ and $p(y)$ are too complicated to get closed forms, and we can only draw samples from them. Is there any way to estimate or draw samples from $p(x)$?

Best Answer

This problem is equivalent to solving a Fredholm integral equation of the first kind. This is, solving for $p(x)$ such that:

$$ p(y) = \int_{\text{supp}(X)} p(y\mid x)\, p(x)\, \text{d}x,\quad \forall y\in \text{supp}(Y) $$

In general, this is an ill-posed inverse problem, and thus it is challenging for both analytical and numerical approaches. As @Glen_b mentions, a characterization of the support of $X$ is needed, as some approaches start off with its discretization.

For instance, notice that:

$$ p(x) = \int p(x\mid y)\,p(y)\,\text{d}y = \int \frac{p(y\mid x)\, p(x)}{\int p(y\mid \tilde{x})\, p(\tilde{x})\,\text{d}\tilde{x}}\,p(y)\,\text{d}y $$

This motivates a simple fixed-point iteration to solve for $p(x)$ in the case of $X$ being discrete with support $\{c_1,\cdots, c_d\}$, $d\geq 2$, and $p(y\mid x)$ being available for direct evaluation:

Initialize $p_0(x)=1/d$, $\forall x\in\{c_1,\cdots, c_d\}$
For each iteration step $n$, draw $m$ i.i.d. samples from $p(y)$, and let:

$$ p_{n}(x) = \frac{1}{m}\sum_{j=1}^m \frac{p(y_i\mid x)\, p_{n-1}(x)}{\sum_{i=1}^d p(y_j\mid c_i)\, p_{n-1}(c_i)},\quad \forall x\in\{c_1,\cdots, c_d\} $$

The motivation is that, for a sufficiently large $N$, $p_N(x)\approx p(x)$ in its support. This is based on:

Kondor (1983). Method of convergent weights — An iterative procedure for solving Fredholm's integral equations of the first kind. Nuclear Instruments and Methods in Physics Research, Volume 216, Issues 1–2, 1983, Pages 177-181, ISSN 0167-5087, https://doi.org/10.1016/0167-5087(83)90348-4.

There are more refined approaches including:

Expectation Maximization Smoothing (EMS): Silverman et al (1990). A Smoothed EM Approach to Indirect Estimation Problems, with Particular, Reference to Stereology and Emission Tomography. Journal of the Royal Statistical Society. Series B (Methodological), 52(2), 271–324. http://www.jstor.org/stable/2345438
Iterative Bayes (IB): Ma (2011). Indirect density estimation using the iterative Bayes algorithm. Computational Statistics & Data Analysis, Volume 55, Issue 3, 2011, Pages 1180-1195, ISSN 0167 9473, https://doi.org/10.1016/j.csda.2010.09.018.
Sequential Monte Carlo (SMC): Crucinio et al (2023). A Particle Method for Solving Fredholm Equations of the First Kind. Journal of the American Statistical Association, 118(542), 937–947. https://doi.org/10.1080/01621459.2021.1962328

Related Solutions

Solved – Obtain marginal CDF from joint CDF by simulation

As indicated in the earlier comments, once you get a sample from the joint distribution of $(X_1,X_2,X_3)$, $$(x_1^1,x_2^1,x_3^1),\ldots,(x_1^t,x_2^t,x_3^t)$$ the marginal sample $$(x_1^1,x_2^1),\ldots,(x_1^t,x_2^t)$$is indeed a sample from the marginal joint distribution of $(X_1,X_2)$ and you can ignore the simulated $x_3^j$'s. They can however be useful in Monte Carlo evaluations through a technique called Rao-Blackwellisation since the average$$\frac{1}{t}\sum_{i=1}^t h(x_1^i,x_2^i)$$is improved by the average$$\frac{1}{t}\sum_{i=1}^t \mathbb{E}[h(x_1^i,x_2^i)|x_3^i]$$as a conditional expectation shares the same expectation as the original but reduces the variance. See for instance this discussion on Cross Validated.

Sampling the max among $N$ samples from the Rayleigh distribution

To quote from Wikipedia

The probability density function of the Rayleigh distribution is$$f(x;\sigma) = \frac{x}{\sigma^2} e^{-x^2/(2\sigma^2)}, \quad x \geq 0,$$where $\sigma$ is the scale parameter of the distribution. The cumulative distribution function is$$F(x;\sigma) = 1 - e^{-x^2/(2\sigma^2)}$$for $x \in (0,\infty).$

Therefore the density of $X_{(N)}$, the largest order statistic, is $$f_{(N)}(x)=N\frac{x}{\sigma^2} e^{-x^2/(2\sigma^2)}[1 - e^{-x^2/(2\sigma^2)}]^{N-1}\quad x\ge 0\tag{1}$$ Simulating from $f_{(N)}$ can proceed by

acceptance-rejection for (1) if a tight upper bound can be found (see 3.)
direct simulation of a sample of $N$ Rayleigh variates and derivation of the largest value
an expansion of $[1 - e^{-x^2/(2\sigma^2)}]^{N-1}$ found in (1) into$$\sum_{i=0}^{N-1} (-1)^i{N-1 \choose i}e^{-ix^2/(2\sigma^2)}$$ and the exploitation of the resulting density as a signed mixture, whose positive elements can be used as proposal in an acceptance-rejection algorithm
taking advantage of the representation of a Rayleigh random variate as$^*$$$X_i=\sigma\sqrt{-2\log U_i}\qquad U_i\sim\mathcal U(0,1)$$since it implies that the largest value$$X_{(N)}=\sigma\sqrt{-2\log U_{(1)}}\qquad U_{(1)}\sim\mathcal Be(1,N)$$corresponds to the smallest Uniform variate, $U_{(1)}$, distributed from the $\mathcal Be(1,N)$ Beta distribution.
inverting the cdf of $X_{(N)}$,$$F_{(N)}(x)=[1 - e^{-x^2/(2\sigma^2)}]^{N}\quad x\ge 0$$into $$X_{(N)}=F_{(N)}^{-1}(U)=\sigma\sqrt{-2\log(1-U^{1/N})}\qquad U\sim\mathcal U(0,1)$$

Options 4 and 5 are equivalent and obviously the most efficient options since their computing time does not increase with $N$ and does not involve rejection.

A quick benchmark

  "beta"={
    N=339
    F=0;for(t in 1:1e3)F=F+rbeta(1,1,N)
  },
  "inverse"={
    N=339;ooN=1/N
    F=0;for(t in 1:1e3)F=F+sqrt(-log(1-runif(1)**ooN))
    F=F*sqrt(2)
  },
  "inverse+"={
    N=339
    F=0;for(t in 1:1e3)F=F+sqrt(-log(1-exp(-rexp(1,N))))
    F=F*sqrt(2)
    },
  replications=1e3}

gives the Beta simulation as fastest (in R):

      test replications elapsed relative user.self sys.self
1     beta         1000   4.208     1.00     4.203    0.003
2  inverse         1000   5.091     1.21     5.092    0.000
3 inverse+         1000   5.304     1.26     5.306    0.000

However, a vectorial version

  "beta"={
    N=339
    mean(rbeta(1e3,1,N))
  },
  "inverse"={
    N=339;ooN=1/N
    mean(sqrt(-log(1-runif(1e3)**ooN)))*sqrt(2)
  },
  "inverse+"={
    N=339
    mean(sqrt(-log(1-exp(-rexp(1e3,N)))))*sqrt(2)
    }

leads to the exact opposite:

      test replications elapsed relative user.self sys.self
1     beta         1000   0.142    1.797     0.143        0
2  inverse         1000   0.081    1.025     0.081        0
3 inverse+         1000   0.079    1.000     0.079        0

$^*$ This representation follows from the interpretation of the standard Rayleigh distribution as the length of a standard Normal bidimensional vector, $X=\sqrt{X_1^2+X_2^2}$, which can be equally written as$$X=\sqrt{-2\cos(2\pi V)^2\log(U)-2\sin(2\pi V)^2\log(U)}$$ as in the Box-Muller algorithm.

Best Answer

Related Solutions

Solved – Obtain marginal CDF from joint CDF by simulation

Sampling the max among $N$ samples from the Rayleigh distribution

Related Question