To start, let's clear up two points of confusion:
(1) "[W]e use the t-distribution if the sample size is small."
Not exactly, if variances $\sigma_1^2,\, \sigma_2^2$ are unknown
and estimated by $S_1^2,\, S_2^2,$ respectively, then you always
use the t-distribution. (If sample sizes are large enough for
degrees of freedom to exceed 30, then in some circumstances
it is OK to use a normal approximation. But with modern software
or printed t tables,
the normal approximation is not necessary. The approximation works
best for tests at the 5% level, not so well at 1%.)
(2) "[A]ssuming that the true standard deviations are not equal, ... then the degrees of freedom is given [by the Welch–Satterthwaite equation]."
No. This equation works whether or not $\sigma_1 = \sigma_2.$ However, if variances are not equal, you must use the Welch–Satterthwaite equation (not the pooled-variance equation with
degrees of freedom $\nu = n_1 + n_2 - 2.)$
Pooled 2-sample t test: If data are normal and population variances are equal, then the test statistic
for testing $H_0: \mu_1 = \mu_2$ against $H_a: \mu_1 \ne \mu_2$ is:
$$T = \frac{\bar X_1 - \bar X_2}{S_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}{}},$$
where $S_p^2 =\frac{(n_1-1)S_1^2 + (n_2-1)S_2^2}{n_1 + n_2 - 2}.$
If $H_0$ is true, then $T$ has Student's t distribution with
degrees of freedom $\nu = n_1 + n_2 - 2.$
Welch 'separate variances' 2-sample t test. However, more generally if $H_0$ is true, the test statistic
$$T^\prime = \frac{\bar X_1 + \bar X_2}{\sqrt{\frac{S_1^2}{n_1} +\frac{S_2^2}{n_2}}}.$$
is approximately distributed according
to Student's t distribution with degrees of freedom $\nu$ given by the Welch-Satterthwaite equation. This is true whether or not
the population variances are equal.
One can show that that degrees of freedom $\nu$ according to the Welch-Satterthwaite equation satisfies
$$\min(n_1 - 1, n_2 - 1) \le \nu \le n_1 + n_2 - 2.$$ So if the smaller of the two sample sizes exceeds 30, then $\nu \ge 30$ and (testing at
the 5% level) it is OK to use a normal approximation for the
distribution of $T^\prime.$
Whatever the sample size, $T^\prime$ has very nearly Student's t distribution with the the Welch-Satterthwaite degrees of freedom.
(This is known from probability theory and from many simulation studies.)
Which to use? The bottom line is that most statisticians use the $T^\prime$-statistic and the Welch-Satterthwaite degrees of freedom to do
2-sample t tests unless they have very strong prior evidence that
population variances are equal (rarely the case). Most modern
statistical software packages use the Welch 2-sample t test by default. Some programs will use $T$ with the pooled SD $S_p$ if
the user overrides the default.
Notes: (a) If $n_1 = n_2,$ then one can show that $T = T^\prime$
numerically, but one should still use the Welch-Satterthwaite degrees of freedom unless the population variances are known to be equal.
(b) If sample variances $S_1^2$ and $S_2^2$ are nearly equal,
then the Welch-Satterthwaite $\nu$ is near $n_1 + n_2 - 2.$ If the
sample variances are far apart then $\nu$ may be considerably smaller---perhaps as small as $n_1 -1$ or $n_2 - 1.$
(c) Especially if $n_1 << n_2$ and $\sigma_2 << \sigma_1,$ then results from the pooled
2-sample test using $T$ and $S_p$ can be very misleading. (The
notation $<<$ means 'much smaller than'.)
(d) It is not a good idea to test whether $\sigma_1^2 = \sigma_2^2$ in order to decide whether to use $T$ or $T^\prime.$ The test for equal variances has poor power, and simulation studies have shown
that the 'hybrid' test (using $T^\prime$ only if the equal-variances test rejects) can give misleading results.
Demonstration of note (c). Using R statistical software:
Small sample from $\mathsf{Norm}(\mu_1=150,\sigma_1=30);$
larger sample from $\mathsf{Norm}(\mu_2=150,\sigma_2=5.)$
The null hypothesis is true, and so should not be rejected.
x1 = rnorm(10, 150, 30); x2 = rnorm(50, 150, 5)
mean(x1); sd(x1)
[1] 139.3158
[1] 31.34551
mean(x2); sd(x2)
[1] 150.1088
[1] 5.246149
Welch 2-sample test properly fails to reject:
t.test(x1, x2)
Welch Two Sample t-test
data: x1 and x2
t = -1.0858, df = 9.1011, p-value = 0.3055
alternative hypothesis: true difference in means is not equal to 0
sample estimates:
mean of x mean of y
139.3158 150.1088
Pooled two-sample t test improperly rejects at the 5% level, 'finding' a
difference in population means that does not actually exist.
(The small sample with the large SD gives a misleading sample mean.)
t.test(x1, x2, var.eq=T)
Two Sample t-test
data: x1 and x2
t = -2.3504, df = 58, p-value = 0.02217
alternative hypothesis: true difference in means is not equal to 0
sample estimates:
mean of x mean of y
139.3158 150.1088
I solved the problem using the MOSEK solver and got the similar results.
K = 600
D=25
train = np.random.multivariate_normal(np.zeros(D), np.diag(np.ones(D)*1),size=K).T
test = np.random.multivariate_normal(np.zeros(D), np.diag(np.ones(D)*4),size=K).T
M = np.zeros((K,K))
for i in range(K):
for j in range(K):
M[i,j] = np.linalg.norm(train[:,i]-test[:,j],2)
Tra = cp.Variable((K,K),nonneg=True)
constraint = [Tra @ np.ones((K,1)) == np.ones((K,1))/K,
Tra.T @ np.ones((K,1)) == np.ones((K,1))/K]
optprob = cp.Problem(cp.Minimize(cp.trace(Tra.T@M)), constraint)
optprob.solve(solver=cp.MOSEK)
print(np.trace(Tra.value.T@M))
The result is around 8.3 for 25 dimension which is not surprising since for the continuous random variables, the Wasserstein distance is a continous optimization problem over all the possible joint probability distributions; while for the empirical distribution, we in fact optimize over matirx $T$, which obviously has a higher optimal objective value than the continous case.
Best Answer
Although a bit old, this is indeed a good question. Here is my bit on the matter:
Regarding Gaussian Mixture Models: A Wasserstein-type distance in the space of Gaussian Mixture Models, Julie Delon and Agnes Desolneux, https://arxiv.org/pdf/1907.05254.pdf
Using the 2-Wasserstein metric, Mallasto and Feragen geometrize the space of Gaussian processes with $L_2$ mean and covariance functions over compact index spaces: Learning from uncertain curves: The 2-Wasserstein metric for Gaussian processes, Anton Mallasto, Aasa Feragen https://papers.nips.cc/paper/7149-learning-from-uncertain-curves-the-2-wasserstein-metric-for-gaussian-processes.pdf
Wasserstein space of elliptical distributions are characterized by Muzellec and Cuturi. Authors show that for elliptical probability distributions, Wasserstein distance can be computed via a simple Riemannian descent procedure: Generalizing Point Embeddings using the Wasserstein Space of Elliptical Distributions, Boris Muzellec and Marco Cuturi https://arxiv.org/pdf/1805.07594.pdf (Not closed form)
Tree metrics as ground metrics yield negative definite OT metrics that can be computed in a closed form. Sliced-Wasserstein distance is then a particular (special) case (the tree is a chain): Tree-Sliced Variants of Wasserstein Distances, Tam Le, Makoto Yamada, Kenji Fukumizu, Marco Cuturi https://arxiv.org/pdf/1902.00342.pdf
Sinkhorn distances/divergences (Cuturi, 2013) are now treated as new forms of distances (e.g. not approximations to $\mathcal{W}_2^2$) (Genevay et al, 2019). Recently, this entropy regularized optimal transport distance is found to admit a closed form for Gaussian measures: Janati et al (2020). This fascinating finding also extends to the unbalanced case.
I would be happy to keep this list up to date and evolving.