t procedures can be used even for clearly skewed distributions.
I don't think it's automatically reasonable to give a specific $n$ for this advice because there will be distributions/data sets that break it. Well, there might just be some sufficient $n$ but for me it doesn't kick in anywhere close to $n=40$.
Does this mean if n for one of my populations ≥20 (or ≥40?), I will have more accurate/powerful results using the 2-sample t-test (assuming unequal variance) instead?
In general, the answer is a very clear 'no'. The t tends to have reasonably good power relative to the MW for light-tailed distributions ... and can have really bad power for heavy-tailed ones. Skewness tends to be compounded with heavy tails - if power is your main motivation for using the t-test, you should probably avoid it in this case.
The Man-Whitney test should be sensitive to the sorts of departures you're trying to pick up (but note that scale and location will be confounded).
Another possibility is to do a permutation, randomization or bootstrap test.
This is not a problem of the t-test, but of any test in which the power of the test depends on the sample size. This is called "overpowering". And yes, changing the test to Mann-Whitney will not help.
Therefore, apart from asking whether the results are statistically significant, you need to ask yourself whether the observed effect size is significant in the common sense of the word (i.e., meaningful). This requires more than statistical knowledge, but also your expertise in the field you are investigating.
In general, there are two ways you can look at the effect size. One way is to scale the difference between the means in your data by its standard deviation. Since standard deviation is in the same units as your means and describes the dispersion of your data, you can express the difference between your groups in terms of standard deviation. Also, when you estimate the variance / standard deviation in your data, it does not necessarily decrease with the number of samples (unlike standard deviation of the mean).
This is, for example, the reasoning behind Cohen's $d$:
$$d = \frac{ \bar{x}_1 - \bar{x}_2 }{ s}$$
...where $s$ is the square root of the pooled variance.
$$s = \sqrt{\frac{ s_1^2\cdot(n_1-1) + s_2^2\cdot(n_2 - 1) }{ N - 2 } }$$
(where $N=n_1+n_2$ and $s_1$ and $s_2$ are the standard deviations in group 1 and 2, respectively; that is, $s_1 = \sqrt{ \frac{\sum(x_i-\bar{x_1})^2 }{n_1 -1 }} $).
Another way of looking at the effect size -- and frankly, one that I personally prefer -- is to ask what part (percentage) of the variability in the data can be explained by the estimated effect. You can estimate the variance between and within the groups and see how they relate (this is actually what ANOVA is, and t-test is in principle a special case of ANOVA).This is the reasoning behind the coefficient of determination, $r^2$, and the related $\eta^2$ and $\omega^2$ stats. Now, in a t-test, $\eta^2$ can easily be calculated from the $t$ statistic itself:
$$\eta^2 = \frac{ t^2}{t^2 + n_1 + n_2 - 2 }$$
This value can be directly interpreted as "fraction of variance in the data which is explained by the difference between the groups". There are different rules of thumb to say what is a "large" and what is a "small" effect, but it all depends on your particular question. 1% of the variance explained can be laughable, or can be just enough.
Best Answer
With large sample sizes the statistic may be quite large.
A typical definition of the U statistic is the number of cross-sample pairs where an observation from the first sample exceeds an observation from the second sample.
[That definition would result in an expected value (if the null hypothesis were true) of $n_1n_2/2$.]
There are several different (but closely related) definitions of the statistic that are used in various packages, which may have different expected values under the null. For example, some make it equivalent to the Wilcoxon-rank sum statistic, by adding $n_1(n_1+1)/2$. Others consider both the first statistic I mentioned and the number of times a value from the second sample exceeds one from the first sample (i.e. swapping which one is considered "the first" and which "the second") and then takes the smaller of the two statistics. Others may subtract the expected value
You would have to consult the help for your package to see which exact definition is used.
Looking at the manual for PAST here, is says (p47):
This is one of the possible statistics I mentioned earlier.
(I note that elsewhere on this page of the manual, it makes a number of incorrect statements about the Mann-Whitney test. Exercise a great degree of caution when reading this manual.)
This definition for the U statistic would make the expected statistic somewhat below 13800 (about 13083). If the p-value is larger than 0.05 then I think you should not be seeing a U value below 12039.
I can't see how you're getting a value as small as 2829 without getting a very small p-value.
So in fact something does seem wrong, but it's not that the statistic is too large -- if everything you've said is correct, your U statistic is much too small.