I wouldn't call 'exponential' particularly highly skew. Its log is distinctly left-skew, for example, and its moment-skewness is only 2.
1) Using the t-test with exponential data and $n$ near 500 is fine:
a) The numerator of the test statistic should be fine: If the data are independent exponential with common scale (and not substantially heavier-tailed than that), then their averages are gamma-distributed with shape parameter equal to the number of observations. Its distribution looks very normal for shape parameter greater than about 40 or so (depending on how far out into the tail you need accuracy).
This is capable of mathematical proof, but mathematics is not science. You can check it empirically via simulation, of course, but if you're wrong about the exponentiality you may need larger samples. This is what the distribution of sample sums (and hence, sample means) of exponential data look like when n=40:
Very slightly skew. This skewness decreases as the square root of the sample size. So at n=160, it's half as skew. At n=640 it's one quarter as skew:
That this is effectively symmetric can be seen by flipping it over about the mean and plotting it over the top:
Blue is the original, red is flipped. As you see, they're almost coincidental.
-
b) Even more importantly, the difference of two such gamma-distributed variables (such as you'd get with means of exponentials) is more nearly normal, and under the null (which is where you need it) the skewness will be zero. Here's that for $n=40$:
That is, the numerator of the t-statistic is very close to normal at far smaller sample sizes than $n=500$.
-
c) What really matters, however, is the distribution of the entire statistic under the null. Normality of the numerator is not sufficient to make the t-statistic have a t-distribution. However, in the exponential-data case, that's also not much of a problem:
The red curve is the distribution of the t-statistic with df=78, the histogram is what using the Welch t-test on exponential samples gets you (under the null of equal mean; the actual Welch-Satterthwaite degrees-of-freedom in a given sample will tend to be a little smaller than 78). In particular, the tail areas in the region of your significance level should be similar (unless you have some very unusual significance levels, they are). Remember, this is at $n=40$, not $n=500$. It's much better at $n=500$.
Note, however, that for actually exponential data, the standard deviation will only be different if the means are different. If the exponential presumption is the case, then under the null, there's no particular need to worry about different population variances, since they only occur under the alternative. So a equal-variance t-test should still be okay (in which case the above good approximation you see in the histogram may even be slightly better).
2) Taking logs may still allow you to make sense of it, though
If the null is true, and you have exponential distributions, you're testing equality of the scale parameters. Location-testing the means of the logs will test equality of logs of the scale parameters against a location shift alternative in the logs (change of scale in the original values). If you conclude that $\log\lambda_1\neq\log\lambda_2$ in a location test in the logs, that's logically the same as concluding that $\lambda_1\neq\lambda_2$. So testing the logs with a t-test works perfectly well as a test of the original hypothesis.
[If you do that test in the logs, I'd be inclined to suggest doing an equal-variance test in that case.]
So - with the mere intervention of perhaps a sentence or two justifying the connection, similar to what I have above - you should be able write your conclusions not about the log of the participation metric, but about the participation metric itself.
3) There's plenty of other things you can do!
a) you can do a test suitable for exponential data. It's easy to derive a likelihood ratio based test. As it happens, for exponential data you get a small-sample F-test (based off a ratio of means) for this situation in the one tailed case; the two tailed LRT would not generally have an equal proportion in each tail for small sample sizes. (This should have better power than the t-test, but the power for the t-test should be quite reasonable, and I'd expect there not to be much difference at your sample sizes.)
b) you can do a permutation-test - even base it on the t-test if you like. So the only thing that changes is the computation of the p-value. Or you might do some other resampling test such as a bootstrap-based test. This should have good power, though it will depend partly on what test statistic you choose relative to the distribution you have.
c) you can do a rank-based nonparametric test (such as the Wilcoxon-Mann-Whitney). If you assume that if the distributions differ, then they differ only by a scale factor (appropriate for a variety of skewed distributions including the exponential), then you can even obtain a confidence interval for the ratio of the scale parameters.
[For that purpose, I'd suggest working on the log-scale (the location shift in the logs being the log of the scale shift). It won't change the p-value, but it will allow you to exponentiate the point estimate and the CI limits to obtain an interval for the scale shift.]
This, too, should tend to have pretty good power if you're in the exponential situation, but likely not as good as using the t-test.
A reference which considers a considerably broader set of cases for the location shift alternative (with both variance and skewness heterogeneity under the null, for example) is
Fagerland, M.W. and L. Sandvik (2009),
"Performance of five two-sample location tests for skewed distributions with
unequal variances,"
Contemporary Clinical Trials, 30, 490–496
It generally tends to recommend the Welch U-test (a particular one of the several tests considered by Welch and the only one they tested). If you're not using exactly the same Welch statistic the recommendations may vary somewhat (though probably not by much). [Note that if your distributions are exponential you're interested in a scale alternative unless you take logs ... in which case you won't have unequal variances.]
Highly discrete and skew variables can exhibit some particular issues in their t-statistics:
For example, consider something like this:
(it has a bit more of a tail out to the right, that's been cut off, going out to 90-something)
The distribution of two-sample t-statistics for samples of size 50 look something like this:
In particular, there are somewhat short tails and a noticeable spike at 0.
Issues like these suggest that simulation from distributions that look something like your sample might be necessary to judge whether the sample size is 'large enough'
Your data seems to have somewhat more of a tail than in my above example, but your sample size is much larger (I was hoping for something like a frequency table). It may be okay, but you could either simulate form some models in the neighborhood of your sample distribution (or you could resample your data) to get some idea of whether those sample sizes would be sufficient to treat the distribution of your test statistics as approximately $t$.
Simulation study A - t.test significance level (based on the supplied frequency tables)
Here I resampled your frequency tables to get a sense of the impact of distributions like you have on the inference from a t-test. I did two simulations, both using your sample sizes for the UsersX and UsersY groups, but in the first instance sampling from the X-data for both and in the second instance sampling from the Y-data for both (to get the H0 true situation)
The results were (not surprisingly given the similarity in shape) fairly similar:
The distribution of p-values should look like a uniform distribution. The reason why it doesn't is probably for the same reason we see a spike in the histogram of the t-statistic I drew earlier - while the general shape is okay, there's a distinct possibility of a mean difference of exactly zero. This spike inflates the type 1 error rate - lifting a 5% significance level to roughly 7.5 or 8 percent:
> sum(tpres1<.05)/length(tpres1)
[1] 0.0769
> sum(tpres2<.05)/length(tpres2)
[1] 0.0801
This is not necessarily a problem - if you know about it. You could, for example, (a) do the test "as is", keeping in mind you will get a somewhat higher type I error rate; or (b) drop the nominal type I error rate by about half (or even a bit more, since it affects smaller significance levels relatively more than larger ones).
My suggestion - if you want to do a t-test - would instead be to use the t-statistic but to do a resampling-based test (do a permutation/randomization test or, if you prefer, do a bootstrap test).
--
Simulation study B - Mann-Whitney test significance level (based on the supplied frequency tables)
To my surprise, by contrast, the Mann-Whitney is quite level-robust at this sample size. This contradicts a couple of sets of published recommendations that I've seen (admittedly conducted at lower sample sizes).
> sum(mwpres1<.05)/length(mwpres1)
[1] 0.0509
> sum(mwpres2<.05)/length(mwpres2)
[1] 0.0482
(the histograms for this case appear uniform, so this should work similarly at other typical significance levels)
Significance levels of 4.8 and 5.1 percent (with standard error 0.22%) are excellent with distributions like these.
On this basis I'd say that - on significance level at least - the Mann Whitney is performing quite well. We'd have to do a power study to see the impact on power, but I don't expect it would do too badly compared to say the t-test (if we adjust things so they're at about the same actual significance level).
So I have to eat my previous words - my caution on the Mann-Whitney looks to be unnecessary at this sample size.
My R code for reading in the frequency tables
#metric1 sample1
UsersX=data.frame(
count=c(182L, 119L, 41L, 11L, 7L, 5L, 5L, 3L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
value=c(0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 12L, 17L, 18L, 20L, 29L, 35L, 42L)
)
#metric 1 sample2
UsersY=data.frame(
count=c(5098L, 2231L, 629L, 288L, 147L, 104L, 50L, 39L, 28L, 22L, 12L, 14L, 8L, 8L,
9L, 5L, 2L, 5L, 5L, 4L, 1L, 3L, 2L, 1L, 1L, 4L, 1L, 4L, 1L, 1L, 1L, 1L, 1L, 1L),
value=c(0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L,
17L, 18L, 19L, 20L, 21L, 22L, 25L, 26L, 27L, 28L, 31L, 33L, 37L, 40L, 44L, 50L, 76L)
My R code for doing simulations
resample=function(tbl,n=sum(tbl$count)) #$
sample(tbl$value,size=n,replace=TRUE,prob=tbl$count) #$
n1=sum(UsersX$count) #$
n2=sum(UsersY$count) #$
tpres1=replicate(10000,t.test(resample(UsersX),resample(UsersX,n2))$p.value) #$
tpres2=replicate(10000,t.test(resample(UsersY,n1),resample(UsersY))$p.value) #$
mwpres1=replicate(10000,wilcox.test(resample(UsersX),resample(UsersX,n2))$p.value)#$
mwpres2=replicate(10000,wilcox.test(resample(UsersY,n1),resample(UsersY))$p.value)#$
# "#$" at end of each line avoids minor issue with rendering R code containing "$"
Best Answer
The answer is "Yes". This is Simpson's paradox applied to mean differences instead of odds ratios. You can read Wiki's article (http://en.wikipedia.org/wiki/Simpson%27s_paradox) to understand the mechanisms behind it. It's a projection problem: If you only see a two dimensional projection of a three dimensional object, you can get quite a wrong impression about the whole picture. In balanced settings (equal group sizes), this is not possible.
Consider, for instance, the following simple setting:
The average of $A = A_1 \cup A_2$ is about 2 and thus much smaller than the average 45 of $B = B_1 \cup B_2$. On the other hand, the average 1 of $A_1$ is larger than the average -9 of $B_1$. Similarly, the average 100 of $A_2$ is larger than the average 99 of $B_2$.