Probability – Distinguishing Between Known Quantities and Random Variables

effect-sizenormal distributionprobabilityrandom variable

For two random variables $X_1$, $X_2$, probability of superiority ($PS$) is defined as the probability that a randomly chosen $X_1$ is greater than a randomly chosen $X_2$:

$PS \equiv Pr(X_1 > X_2) \tag{1}$

Suppose $X_1$, $X_2$ follow independent normal distributions with known parameters.

Since the parameters are known,

$PS = \Phi\left(\frac{\mu_1 – \mu_2}{\sqrt{\sigma_1^2 + \sigma_2^2}}\right) \tag{2}$

Specifically, because distribution parameters are known, $PS$ is a single known quantity, not a random variable.

Suppose now that the distribution parameters are not known, but we observe some data on $X_1$, $X_2$. My objective is to find $PS$ conditional on the data. How do we calculate $PS$? I can see two approaches.

Approach 1. The distribution parameters conditional on the data have known sampling distributions (normal for $\mu$ and chi-square for $\sigma^2$). Draw parameters from these known sampling distributions. For each set of parameters drawn, calculate $PS$ per $(2)$.

This gives us a bootstrapped distribution of $PS$. $PS$ is a random variable with a distribution.

Approach 2. Per the definition $(1)$, we are actually interested in the distributions of future data, not in the sampling distributions of the parameters. Conditional on the data (not the parameters), $X$'s have a known distribution. The posterior predictive of each $X$ is the 3-parameter Student t distribution (see here, equation 100).

We could work out the difference between two random variables that have 3-parameter Student t distributions, and the result will be similar to $(2)$ but with heavier tails and, instead of having parameters in the equation, it will have statistics.

Under this approach, because exact distributions of $X_1$, $X_2$ conditional on the data are known, $PS$ turns out to be a single known quantity, not a random variable, just like when distribution parameters were known.

Question. Which approach is correct?

It feels to me like Approach 2 is correct, as it integrates out distribution parameters, which are not even part of the definition of $PS$. The "problem" with this approach is that since $PS$ is known, it has no confidence interval and one cannot do hypothesis tests on it.

Best Answer

There is no essential difference between the two approaches. If we define your unknown parameters as $\theta$ ($\mu_1,\mu_2,\sigma^2_1,\sigma^2_2$), then in your first approach you calculate the conditional probability:

$$ P(X_1 > X_2 | \theta )$$

and then sample $\theta$ to obtain a distribution, while in the second approach you calculate the marginal probability :

$$ P(X_1 > X_2) = \int d\theta \pi(\theta)P(X_1 > X_2 | \theta ) $$

If you consider $\theta$ as a random variable, then the conditional probability is a random variable as well, while the marginal probability is the expectation of it.

Edit

There remain questions of what the PDF $f$ looks like. It can be computed by numerically inverting the Fourier Transform by computing

$$f(x) = \frac{1}{2\pi}\int_{\mathbb R} e^{-i x t} \psi(t)\,\mathrm{d}t = \frac{1}{2\pi}\int_{\mathbb R} \frac{e^{-i x t}}{\sqrt{(1+t^2)(1+3t^2)}}\,\mathrm{d}t.$$

This expression, by the way, fully answers the original question. The aim of the rest of this section is to show it is a practical answer.

Numerical integration will become problematic once $|x|$ exceeds $10$ or $15,$ but with a little patience can be accurately computed.

In light of the analysis of differences of Gamma variables at https://stats.stackexchange.com/a/72486/919, it is tempting to approximate the result by a mixture of the two Laplace distributions. The best approximation near the middle of the distribution is approximately $0.4$ times Laplace$(1)$ plus $0.6$ times Laplace$(\sqrt{3}).$ However, the tails of this approximation are a little too heavy.

The left hand plot in this figure is a histogram of 100,000 realizations of $x_4(x_1-x_3) + x_5(x_2-x_1).$ On it are superimposed (in black) the numerical calculation of $f$ and then, in red, its mixture approximation. The approximation is so good it coincides with $f.$ However, it's not perfect, as the related plot at right shows. This plots $f$ and its approximation on a logarithmic scale. The decreasing accuracy of the approximation in the tails is clear.

Here is an R function for computing values of a PDF that is specified by its characteristic function. It will work for any numerically well-behaved CF (especially one that decays rapidly).

cf <- Vectorize(function(x, psi, lower=-Inf, upper=Inf, ...) {
  g <- function(y) Re(psi(y) * exp(-1i * x * y)) / (2 * pi)
  integrate(g, lower, upper, ...)$value
}, "x")

As an example of its use, here is how the black graphs in the figure were computed.

f <- function(t) ((1 + t^2) * (1 + 3*t^2)) ^ (-1/2)
x <- seq(0, 15), length.out=101)
y <- cf(x, f, rel.tol=1e-12, abs.tol=1e-14, stop.on.error=FALSE, subdivisions=2e3)

The graph is constructed by connecting all these $(x,y)$ values.

This calculation for $101$ values of $|x|$ between $0$ and $15$ takes about one second. It is massively parallelizable.

For more accuracy, increase the subdivisions argument--but expect the computation time to increase proportionally. (The figure used subdivisions=1e4.)

Best Answer

Related Solutions

Solved – Probability that one random variable is larger than another with known correlation

Distribution of $x_4(x_1-x_3)+x_5(x_2-x_1)$ with iid $x_i \sim N(0,1)$

Edit

Related Question