I solved the problem using the MOSEK solver and got the similar results.
K = 600
D=25
train = np.random.multivariate_normal(np.zeros(D), np.diag(np.ones(D)*1),size=K).T
test = np.random.multivariate_normal(np.zeros(D), np.diag(np.ones(D)*4),size=K).T
M = np.zeros((K,K))
for i in range(K):
for j in range(K):
M[i,j] = np.linalg.norm(train[:,i]-test[:,j],2)
Tra = cp.Variable((K,K),nonneg=True)
constraint = [Tra @ np.ones((K,1)) == np.ones((K,1))/K,
Tra.T @ np.ones((K,1)) == np.ones((K,1))/K]
optprob = cp.Problem(cp.Minimize(cp.trace(Tra.T@M)), constraint)
optprob.solve(solver=cp.MOSEK)
print(np.trace(Tra.value.T@M))
The result is around 8.3 for 25 dimension which is not surprising since for the continuous random variables, the Wasserstein distance is a continous optimization problem over all the possible joint probability distributions; while for the empirical distribution, we in fact optimize over matirx $T$, which obviously has a higher optimal objective value than the continous case.
Fix measures $\mu_0,\mu_1,\nu_0,\nu_1$ on the same probability space with $\mu_0\ll\nu_0$ and $\mu_1\ll\nu_1$, i.e. the interesting case. For $\alpha\in[0,1]$ let $\mu_\alpha=(1-\alpha)\mu_0+\alpha\mu_1$ and $\nu_\alpha=(1-\alpha)\nu_0+\alpha\nu_1$. Notice that $\mu_\alpha\ll\nu_\alpha$.
For the following argument, I prefer to work with random variables, so let $B_\alpha\in\{0,1\}$ be Bernoulli with success probability $\alpha$.
Let $X_\alpha$ given $B_\alpha$ have the law $\mu_{B_\alpha}$, and let $Y_\alpha$ given $B_\alpha$ have the law $\nu_{B_\alpha}$.
Using this machinery, I can simply write $\mathcal H(X_\alpha\|Y_\alpha)=\mathcal H((1-\alpha)\mu_0+\alpha\mu_1\|(1-\alpha)\nu_0+\alpha\nu_1)$.
The right hand side of the inequality is the conditional relative entropy of $X_\alpha$ given $B_\alpha$ and $Y_\alpha$ given $B_\alpha$. However, since $\mathcal H(B_\alpha\|B_\alpha)=0$ is clearly trivial, the conditional relative entroy equals the joint relative entropy, that is, we have $\mathcal H(B_\alpha,X_\alpha\|B_\alpha,Y_\alpha)=(1-\alpha)\mathcal H(X_0\|Y_0)+\alpha\mathcal H(X_1\|Y_1)$, which we verify by establishing that the Radon-Nikodym derivative of $(B_\alpha,X_\alpha)$ with respect to $(B_\alpha,Y_\alpha)$ coincides with the $(X_0,Y_0)$-derivative on $B_\alpha=0$ and with the $(X_1,Y_1)$-derivative on $B_\alpha=1$.
Now, let us introduce some notation for the conditional relative entropy, say $\mathcal H(X_\alpha|B_\alpha\|Y_\alpha|B_\alpha)$, so we just discussed that
$\mathcal H(B_\alpha,X_\alpha\|B_\alpha,Y_\alpha)=\mathcal H(B_\alpha\|B_\alpha)+\mathcal H(X_\alpha|B_\alpha\|Y_\alpha|B_\alpha)=\mathcal H(X_\alpha|B_\alpha\|Y_\alpha|B_\alpha)$, by the chain rule for the relative entropy. This chain rule goes two ways, so we also have
$\mathcal H(B_\alpha,X_\alpha\|B_\alpha,Y_\alpha)=\mathcal H(X_\alpha\|Y_\alpha)+\mathcal H(B_\alpha|X_\alpha\|B_\alpha|Y_\alpha)$.
So, the convexity of the relative entropy is just a special case of the chain rule, using that the conditional relative entropy $\mathcal H(B_\alpha|X_\alpha\|B_\alpha|Y_\alpha)\ge 0$ is non-negative.
Unfortunately, the conditional relative entropy is usually only defined for finite supports, using kernels or under other restrictions, since it seems that we may need access to some kind of conditional distribution. However, this is not true, the conditional relative entropy can be defined in general, as discussed here, namely using the chain rule for the definition. Non-negativity is also shown here, which amounts to a straightforward application of Jensen's inequality.
Best Answer
This does not quite answer your question, as your question concerns $W_1$ instead of $W_2^2$, but since you have not gotten an answer so far i feel like it may be helpful to provide some related results.
The following is an excerpt from Lectures on Optimal Transport (Lecture 8 Section 1) by Ambrosio, Brué and Semola. It asserts that no such bound can hold for the $L_2$ distance between the densities and the $W_2^2$ distance between the associated measures:
Perhaps you can find a similar argument to show the same result for $W_1$. Once i find anything else relating to your problem, i will edit my answer to provide more information.
EDIT
I have now found a similar argument which shows that no such bound can hold for $W_1$: Let $\mu$ be supported on $[a,b-H]$ with density $\phi$ and for $h<H$ define the shifted measure $\mu_h$ to be the measure with density $\phi_h(x):=\phi(x-h)$. Then all such measures are concentrated on $[a,b]$. Note that just as in the excerpt i posted above, we have $W_1(\mu,\mu_h)=h$. Similarly, if we choose $\phi$ and $h$ such that $\text{supp}\phi\cap \text{supp}\phi_h=\emptyset$, then $$\Vert \phi-\phi_h\Vert_{L_1}=2\Vert\phi_h\Vert_{L_1}=2.$$ Now for any $C>0$ we can pick $h$ small enough such that $$\Vert \phi-\phi_h\Vert_{L_1}=2>CW_1(\mu,\mu_h)=C\cdot h$$ and hence there cannot exist a $C>0$ such that the desired inequality holds for all $\mu,\nu$ supported on $[a,b]$.