(to long for a comment, so I guess it's an answer)
I'm not sure what makes you assert there's a substantive difference between the two cases. When you use Mann-Whitney for testing location-shift alternatives, the assumption is of identical distributions aside from the possible location shift. It's not actually necessary to assume identical distributions. The Mann-Whitney, is, for example, perfectly appropriate for testing scale shift alternatives, or a host of other alternatives, as long as you can compute the distribution of the test statistic under the null. If your rank-based anova is to have a distribution you can compute under $H_0$, you'll need at least some assumptions for the null case there also.
If your assumptions for both are the same (such as both being applied to shift alternatives) and you compute the null distribution on an ANOVA for 2 groups of ranks correctly, your p-values will be identical to the equivalent two-tailed Mann-Whitney, in the same way that $t^2 = F$ for an ordinary 2 group ANOVA compared to a two-tailed two-sample-t (the version with equal-variance).
if I had two groups and both had different, non-normal distributions, but I only wanted to test for a difference in location what test would be preferable? I was under the impression I could use a t-test on ranks, or a Welch t-test on ranks. However, if these tests are the similar to a M_W U test then I guess this is not the case.
It's somewhat of a tricky question, because if they're different shapes 'location difference' doesn't have an obvious meaning in the way it does when they're the same shape.
If you define some measure of location difference (like difference in means or difference in medians or median of pairwise differences or difference in minimum or whatever) then you can do something with it - e.g. try to compute a resampling based distribution, like a bootstrap distribution. It's important to be clear about what you are prepared to assume though.
A Mann-Whitney can be used for more general alternatives than a simple location shift. e.g. For continuous distributions, you can write the null in the form:
$P(X>Y) = \frac{1}{2}$
and the alternative as
$P(X>Y) \neq \frac{1}{2}\quad$ (for a two tailed test)
or
$P(X>Y) < \frac{1}{2}\quad$ (or "$>$", in either case as a one tailed test)
If I recall correctly, Conover's Practical Nonparametric Statistics presents them this way, for example.
In scipy.stats, the Mann-Whitney U test compares two populations:
Computes the Mann-Whitney rank test on samples x and y.
but the Wilcoxon test compares two PAIRED populations:
The Wilcoxon signed-rank test tests the null hypothesis that two
related paired samples come from the same distribution. In particular,
it tests whether the distribution of the differences x - y is
symmetric about zero. It is a non-parametric version of the paired
T-test.
EDITED / CORRECTED in response to ttnphns' comments.
Note that the t does not test for whether the distribution of the differences is symmetric about zero, so the Wilcoxon signed rank test is not truly a non-parametric counterpart of the paired t test.
The Mann-Whitney test, on the other hand, assumes that all the observations are independent of each other (no basis for pairing here!). It also assumes that the two distributions are the same, and the alternative is that one is stochastically greater than the other. If we make the additional assumption that the only difference between the two distributions is their location, and the distributions are continuous, then "stochastically greater than" is equivalent to such statements as "the medians are different", so you can, with the extra assumption(s), interpret it that way.
The Mann-Whitney uses a continuity correction by default, but the Wilcoxon doesn't.
The Mann-Whitney handles ties using the midrank, but the Wilcoxon offers three options for handling ties in the paired values (i.e., zero difference between the two elements of the pair.)
It sounds like the Wilcoxon test is the more appropriate for your purposes, since you do have that lack of independence between all observations. However, one might imagine that requests with similar, but not equal, lengths might exhibit similar behavior, whereas the Wilcoxon would assume that if they aren't paired, they are independent. A logistic regression model might serve you better in this case.
Quotes are from the scipy.stats doc pages, which we aren't supposed to link to, apparently.
Best Answer
In no way will the difference in sample sizes adversely affect the Mann-Whitney-Wilcoxon test. It's explicitly suitable for groups of different sizes, and how different doesn't impact the essential properties of the test.
There's little more to say without some clearer indication of what your colleague thinks the problem is (aside from a burning desire to get a different outcome... which when taken to the point of action is called p-hacking -- even if they were hoping to not reject).
Even if the ratio of sample sizes had been much more extreme -- say $n_1=100000$ and $n_2=2$ there's literally no issue, and no justification I can discern for reducing the larger sample size.
[On the other hand if you're in a position to choose sample sizes beforehand, and you can make the smaller sample size nearer in size to the larger one it may be worth trading a lot of values from the larger sample to get some more in the smaller sample (increasing the smaller sample size from 700 to 770 would be worthwhile even if it meant you could only afford 5000 in the larger sample rather than 14000). That's not what we're discussing here though.]
If your colleague is hoping that a reduction in the larger sample size (by randomly choosing a smaller sample) will make the test more likely to reject, it won't. The power will decrease somewhat (you'll reduce the typical effect size you're able to detect at a given level of power). e.g. if there was 50% power for a given effect size with 14000 and 700, reducing sample size to 700 and 700 would in many situations reduce the power to under 30%.
As a result, if you failed to reject with a large sample size, unless your colleague plans to fudge the sampling there's very little chance of obtaining a rejection (but fudging or not, there's no justification for doing this on the present information).
However - while it sounds like this isn't the problem here - it's often the case that when people are bothered by large size, they're bothered by a rejection at a large sample size. When that happens, it's usually because they have confused statistical significance with any kind of practical meaningfulness. [If they expect a procedure to identify only practical / meaningful differences they really have no business using an ordinary null-hypothesis-type significance test in the first place. That's not what it does and that's not what it's for. With very large sample sizes it will identify very small differences.]
Another common problem with the Mann-Whitney-Wilcoxon is that often people have obtained a misunderstanding (usually straight out of one or another popular text) of what this test actually tests for (for example, expressing amazement that the test rejects equality of medians when the sample medians are identical).
Without more details it's hard to suggest anything more, but with more details there might have been a way to offer some additional thoughts.