From Hollander & Wolfe pp 106-7,
Let $F$ be the distribution function corresponding to population 1 and
$G$ be the distribution function corresponding to population 2. The
null hypothesis is: $H_O: F(t)=G(t)$ for every $t$. The null
hypothesis asserts that the $X$ variable and the $Y$ variable have the
same probability distribution, but the common distribution is not
specified.
Strictly speaking this describes the Wilcoxon test, but $U=W-\frac{n(n+1)}{2}$, so they're equivalent.
There have been a number of papers which examine this issue. Most of them come to the conclusion that Welch's version of the t-test can be safely used in most circumstances.
The only situation in which the test seems to have undesirable performance is in very small sample sizes.
Here are some quotes from two papers which examine t-test performance with small sample sizes:
The t-test with the unequal variances option (i.e.,
the Welch test) was generally not preferred either. Only
in the case of unequal variances combined with unequal
sample sizes, where the small sample was drawn from
the small variance population, did this approach
provide a power advantage compared to the regular ttest.
In the other cases, a substantial amount of
statistical power was lost compared to the regular t-test.
The power loss of the Welch test can be explained by
its lower degrees of freedom determined from the
Welch-Satterthwaite equation.$^1$
Results suggest that the Welch t test is indeed
inflated, according to Bradley's (1978) fairly
stringent criterion, when sample sizes are
unequal – even when assumptions for the t test
are met in the population. The inflation rate
seems to be dependent more on the size of the
smaller group than on the total sample size, but
sample size ratio does seem to play a small
role$^2$
If you read through those papers though, you'll see that it's really only in the specific case with very small sample sizes (in particular, when the smaller of the two groups is very small) that it's much of an issue. "Small" meaning the effects are really only troublesome when a group contains around 5 subjects or less as posited by both papers, but take a closer look at the references for a more thorough discussion. In that case, you might (obviously) suggest collecting more data. But this can of course be an issue with prohibitively expensive experiments.
Otherwise Welch's is probably fine.
$^1$ : Using the Student’s t-test with extremely small sample sizes, J.C.F. de Winter 2013
$^2$ : Type I Error Inflation of the Separate-Variances
Welch t test with Very Small Sample Sizes when
Assumptions Are Met, Albert K. Adusah and Gordon P. Brooks 2011
Best Answer
The Mann-Whitney test is a special case of a permutation test (the distribution under the null is derived by looking at all the possible permutations of the data) and permutation tests have the null as identical distributions, so that is technically correct.
One way of thinking of the Mann-Whitney test statistic is a measure of the number of times a randomly chosen value from one group exceeds a randomly chosen value from the other group. So the P(X>Y)=0.5 also makes sense and this is technically a property of the equal distributions null (assuming continuous distributions where the probability of a tie is 0). If the 2 distributions are the same then the probability of X being Greater than Y is 0.5 since they are both drawn from the same distribution.
The stated case of 2 distributions having the same mean but widely different variances matches with the 2nd null hypothesis, but not the 1st of identical distributions. We can do some simulation to see what happens with the p-values in this case (in theory they should be uniformly distributed):
So clearly this is rejecting more often than it should and the null hypothesis is false (this matches equality of distributions, but not prob=0.5).
Thinking in terms of probability of X > Y also runs into some interesting problems if you ever compare populations that are based on Efron's Dice.