Distributions – Understanding the Distribution of the Difference Between Two Correlated Non-Central T Distributions

differencesdistributionsnon-centralt-distribution

Suppose a binormal population $\{X, Y\}$ with means $\mathbf{\mu} = \{\mu_1,\mu_2\} \ne \{0,0\}$ and covariance $\Sigma= \sigma^2\begin{bmatrix}1 & \rho\\ \rho &1 \end{bmatrix}$. Let $S^2$ be an estimator of $\sigma^2$ with $f$ degrees of freedom.

It is known that the variates $\{X/S, Y/S\}$ follows a bivariate non-central $t$ distribution (e.g., Kshirsagar, 1961).

Walgren (1980) derived the distribution of the product $X/S \times Y/S$. Is there a derivation for their difference, $X/S – Y/S$?

Edit: In the present context, I use the pooled standard deviation, that is the mean of the separate standard deviations,

$$S_X = \sum_{i=1}^n (X_i-\bar{X})^2 / (n-1)$$
$$S_Y = \sum_{i=1}^n (Y_i-\bar{Y})^2 / (n-1)$$

with $ S = \sqrt{ (S_X^2 + S_Y^2)/2 }$. This estimate of $\sigma$ is independent of both $X$ and $Y$, and as shown here is a chi-square distribution with degrees of freedom $2(n-1)/(1+\rho^2)$.

Best Answer

Here is an approach that shows how to obtain a numerical approximation to the probability density function of $Z=X/S-Y/S$. (I haven't been successful in finding an analytic solution.)

Using the joint density of $X/S$ and $Y/S$ found in Kshirsagar 1961 (as given in the question):

r = {{1, \[Rho]}, {\[Rho], 1}}; (* Correlation matrix *)
\[Mu] = {\[Mu]x, \[Mu]y};
t = {x, y};
f = 2 (n - 1)/(1 + \[Rho]^2); (* Degrees of freedom for estimate of S^2 *)
jointPDF = (Exp[-\[Mu] . Inverse[r] . \[Mu]/(2 \[Sigma]^2)]/(\[Pi] f Sqrt[ Det[r]] Gamma[f/2]))*
  Sum[(2^(\[Alpha]/2) (t . Inverse[r] . \[Mu])^\[Alpha]  Gamma[(f + 2 + \[Alpha])/2])/
  (\[Sigma]^\[Alpha] f^(\[Alpha]/2) \[Alpha]! 
  (1 + t . Inverse[r] . t/f)^((f + 2 + \[Alpha])/2)), {\[Alpha], 0, \[Infinity]}];
jointPDF = FullSimplify[jointPDF, Assumptions -> {-1 < \[Rho] < 1, \[Sigma] > 0, n > 1, 
    n \[Element] Integers, \[Mu]x \[Element] Reals, \[Mu]y \[Element] 
     Reals, x \[Element] Reals, y \[Element] Reals}]

A more readable version of the code is below:

The result is

The pdf of the difference $Z = X/S - Y/S$ can be found numerically by replacing $y$ with $x-z$ and then integrating over $x$:

(* Numerical estimate of pdf of X/S - Y/S for a few values of n \
(sample size for estimating \[Sigma])*)
pdfz100 = 
  Table[{z, 
    NIntegrate[
     jointPDF /. {y -> x - z, 
       n -> 100, \[Sigma] -> 2, \[Rho] -> 1/2, \[Mu]x -> 1, \[Mu]y -> 
        3}, {x, -\[Infinity], \[Infinity]}]}, {z, -6, 3, 1/10}];
pdfz4 = Table[{z, 
    NIntegrate[
     jointPDF /. {y -> x - z, 
       n -> 4, \[Sigma] -> 2, \[Rho] -> 1/2, \[Mu]x -> 1, \[Mu]y -> 
        3}, {x, -\[Infinity], \[Infinity]}]}, {z, -6, 3, 1/10}];
pdfz2 = Table[{z, 
    NIntegrate[
     jointPDF /. {y -> x - z, 
       n -> 2, \[Sigma] -> 2, \[Rho] -> 1/2, \[Mu]x -> 1, \[Mu]y -> 
        3}, {x, -\[Infinity], \[Infinity]}]}, {z, -6, 3, 1/10}];
ListPlot[{pdfz100, pdfz4, pdfz2}, Joined -> True, ImageSize -> Large, 
 PlotLegends -> {"n = 100", "n = 4", "n = 2"},
 PlotLabel -> 
  Style["\[Sigma] = 2, \[Rho] = 1/2, \!\(\*SubscriptBox[\(\[Mu]\), \
\(x\)]\) = 1, \!\(\*SubscriptBox[\(\[Mu]\), \(y\)]\) = 3", Bold, 18]]

Again, a more readable version:

The results follow:

As a check one should perform some simulations.

parms = {\[Sigma] -> 2, \[Rho] -> 1/2, \[Mu]x -> 1, \[Mu]y -> 3, n -> 2};
nsim = 100000; (* Number of simulations *)
(* Data for x and y *)
data = RandomVariate[
  BinormalDistribution[{\[Mu]x, \[Mu]y}, {\[Sigma], \[Sigma]}, \[Rho]] /. parms, nsim];
(* Data to for estimating S *)
xy = RandomVariate[
   BinormalDistribution[{mu1, mu2}, {\[Sigma], \[Sigma]}, \[Rho]] /. 
    parms, {nsim, n /. parms}];
s = Sqrt[(Variance[#[[All, 1]]]/2 + Variance[#[[All, 2]]]/2) & /@ xy];
(* Z = X/S - Y/S *)
zz = data[[All, 1]]/s - data[[All, 2]]/s;

(* Numerically estimate the pdf of z *)
pdfz = Table[{z, NIntegrate[jointPDF /. y -> x - z /. parms, 
  {x, -\[Infinity], \[Infinity]}]},
   {z, Quantile[zz, 0.005], Quantile[zz, 0.995], (Quantile[zz, 0.995] - Quantile[zz, 0.005])/200}];

(* Plot the results *)
Show[Histogram[zz, "FreedmanDiaconis", "PDF"], 
 ListPlot[pdfz, Joined -> True, PlotRange -> All]]

There seems to be a match.

Related Solutions

Difference of t-Distributions – What Is the Distribution of the Difference Between Two t-Distributions?

The sum of two independent t-distributed random variables is not t-distributed. Hence you cannot talk about degrees of freedom of this distribution, since the resulting distribution does not have any degrees of freedom in a sense that t-distribution has.

Solved – Confidence interval for the mean – Normal distribution or Student’s t-distribution

1. Normal data, variance known: If you have observations $X_1, X_2, \dots, X_n$ sampled at random from a normal population with unknown mean $\mu$ and known standard deviation $\sigma,$ then a 95% confidence interval (CI) for $\mu$ is $\bar X \pm 1.95 \sigma/\sqrt{n}.$ This is the only situation in which the z interval is exactly correct.

2. Nonnormal data, variance known: If the population distribution is not normal and the sample is 'large enough', then $\bar X$ is approximately normal and the same formula provides an approximate 95% CI. The rule that $n \ge 30$ is 'large enough' is unreliable here. If the population distribution is heavy-tailed, then $\bar X$ may not have a distribution that is close to normal (even if $n \ge 30).$ The 'Central Limit Theorem', often provides reasonable approximations for moderate values of $n,$ but it is a limit theorem, with guaranteed results only as $n \rightarrow \infty.$

3. Normal data, variance unknown. If you have observations $X_1, X_2, \dots, X_n$ sampled at random from a normal population with unknown mean $\mu$ and standard deviation $\sigma,$ with $\mu$ estimated by the sample mean $\bar X$ and $\sigma$ estimated by the sample standard deviation $S.$ Then a 95% confidence interval (CI) for $\mu$ is $\bar X \pm t^* S/\sqrt{n},$ where $S$ is the sample standard deviation and where $t^*$ cuts probability $0.025$ from the upper tail of Student's t distribution with $n - 1$ degrees of freedom. This is the only situation in which the t interval is exactly correct.

Examples: If $n=10$, then $t^* = 2.262$ and if $n = 30,$ then $t^* = 2.045.$ (Computations from R below; you could also use a printed 't table'.)

qt(.975, 9);  qt(.975, 29)
[1] 2.262157  # for n = 10
[1] 2.04523   # for n = 30

Notice that 2.045 and 1.96 (from Part 1 above) both round to 2.0. If $n \ge 30$ then $t^*$ rounds to 2.0. That is the basis for the 'rule of 30', often mindlessly parroted in other contexts where it is not relevant.

There is no similar coincidental rounding for CIs with confidence levels other than 95%. For example, in Part 1 above a 99% CI for $\mu$ is obtained as $\bar X \pm 2.58 \sigma/\sqrt{n}.$ However, $t^*=2.76$ for $n = 30$ and $t^* = 2.65$ for $n = 70.$

qnorm(.995)
[1] 2.575829
qt(.995, 29)
[1] 2.756386
qt(.995, 69)
[1] 2.648977

4. Nonnormal data, variance unknown: Confidence intervals based on the t distribution (as in Part 3 above) are known to be 'robust' against moderate departures from normality. (If $n$ is very small, there should be no far outliers or evidence of severe skewness.) Then, to a degree that is difficult to predict, a t CI may provide a useful CI for $\mu.$ By contrast, if the type of distribution is known, it may be possible to find an exact form of CI.

For example, if $n = 30$ observations from a (distinctly nonnormal) exponential distribution with unknown mean $\mu$ have $\bar X = 17.24,\, S = 15.33,$ then the (approximate) 95% t CI is $(11.33, 23.15).$

t.test(x)

        One Sample t-test

data:  x
t = 5.9654, df = 29, p-value = 1.752e-06
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 11.32947 23.15118
sample estimates:
mean of x 
 17.24033

However, $$\frac{\bar X}{\mu} \sim \mathsf{Gamma}(\text{shape}=n,\text{rate}=n),$$ so that $$P(L \le \bar X/\mu < U) = P(\bar X/U < \mu < \bar X/L)=0.95$$ and an exact 95% CI for $\mu$ is $(\bar X/U,\, \bar X/L) = (12.42, 25.16).$

qgamma(c(.025,.975), 30, 30)
[1] 0.6746958 1.3882946
mean(x)/qgamma(c(.975,.025), 30, 30)
[1] 12.41835 25.55274

Addendum on bootstrap CI: If data seem non-normal, but the actual population distribution is unknown, then a 95% nonparametric bootstrap CI may be the best choice. Suppose we have $n=20$ observations from an unknown distribution, with $\bar X$ = 13.54$ and values shown in the stripchart below.

The observations seem distinctly right-skewed and fail a Shapio-Wilk normality test with P-value 0.001. If we assume the data are exponential and use the method in Part 4, the 95% CI is $(9.13, 22.17),$ but we have no way to know whether the data are exponential.

Accordingly, we find a 95% nonparametric bootstrap in order to approximate $L^*$ and $U^*$ such that $P(L^* < D = \bar X/\mu < U^*) \approx 0.95.$ In the R code below the suffixes .re indicate random 're-sampled' quantities based on $B$ samples of size $n$ randomly chosen without replacement from among the $n = 20$ observations. The resulting 95% CI is $(9.17, 22.71).$ [There are many styles of bootstrap CIs. This one treats $\mu$ as if it is a scale parameter. Other choices are possible.]

B = 10^5; a.obs = 13.54
d.re = replicate(B, mean(sample(x, 20, rep=T))/a.obs)
UL.re = quantile(d.re, c(.975,.025))
a.obs/UL.re
    97.5%      2.5%
 9.172171 22.714980

Best Answer

Related Solutions

Difference of t-Distributions – What Is the Distribution of the Difference Between Two t-Distributions?

Solved – Confidence interval for the mean – Normal distribution or Student’s t-distribution

Related Question