T-Test – Should T-Test Be Used on Highly Skewed Data?

hypothesis testingmeannonparametricskewnesst-test

I have samples from a highly skewed (looking like an exponential distribution) dataset about users' participation (e.g.: number of posts), that have different sizes (but not less than 200) and I want to compare their mean. For that, I'm using two-sample unpaired t-tests(and t-tests with the Welch’s factor, when the samples had different variances). As I have heard that, for really large samples, it doesn't matter that the sample are not normal distributed.

Someone, reviewing what I've done, said that the tests I am using were not suitable for my data. They suggested to log-transform my samples before using the t-tests.

I am a beginner, so it sounds really confusing to me to answer my research questions with "log of participation metric".

Are they wrong? Am I wrong? If they are wrong, is there a book or scientific paper which I could cite/show them? If I am wrong, which test should I use?

Best Answer

I wouldn't call 'exponential' particularly highly skew. Its log is distinctly left-skew, for example, and its moment-skewness is only 2.

1) Using the t-test with exponential data and $n$ near 500 is fine:

a) The numerator of the test statistic should be fine: If the data are independent exponential with common scale (and not substantially heavier-tailed than that), then their averages are gamma-distributed with shape parameter equal to the number of observations. Its distribution looks very normal for shape parameter greater than about 40 or so (depending on how far out into the tail you need accuracy).

This is capable of mathematical proof, but mathematics is not science. You can check it empirically via simulation, of course, but if you're wrong about the exponentiality you may need larger samples. This is what the distribution of sample sums (and hence, sample means) of exponential data look like when n=40:

enter image description here

Very slightly skew. This skewness decreases as the square root of the sample size. So at n=160, it's half as skew. At n=640 it's one quarter as skew:

enter image description here

That this is effectively symmetric can be seen by flipping it over about the mean and plotting it over the top:

enter image description here

Blue is the original, red is flipped. As you see, they're almost coincidental.

-

b) Even more importantly, the difference of two such gamma-distributed variables (such as you'd get with means of exponentials) is more nearly normal, and under the null (which is where you need it) the skewness will be zero. Here's that for $n=40$:

enter image description here

That is, the numerator of the t-statistic is very close to normal at far smaller sample sizes than $n=500$.

-

c) What really matters, however, is the distribution of the entire statistic under the null. Normality of the numerator is not sufficient to make the t-statistic have a t-distribution. However, in the exponential-data case, that's also not much of a problem:

enter image description here

The red curve is the distribution of the t-statistic with df=78, the histogram is what using the Welch t-test on exponential samples gets you (under the null of equal mean; the actual Welch-Satterthwaite degrees-of-freedom in a given sample will tend to be a little smaller than 78). In particular, the tail areas in the region of your significance level should be similar (unless you have some very unusual significance levels, they are). Remember, this is at $n=40$, not $n=500$. It's much better at $n=500$.

Note, however, that for actually exponential data, the standard deviation will only be different if the means are different. If the exponential presumption is the case, then under the null, there's no particular need to worry about different population variances, since they only occur under the alternative. So a equal-variance t-test should still be okay (in which case the above good approximation you see in the histogram may even be slightly better).


2) Taking logs may still allow you to make sense of it, though

If the null is true, and you have exponential distributions, you're testing equality of the scale parameters. Location-testing the means of the logs will test equality of logs of the scale parameters against a location shift alternative in the logs (change of scale in the original values). If you conclude that $\log\lambda_1\neq\log\lambda_2$ in a location test in the logs, that's logically the same as concluding that $\lambda_1\neq\lambda_2$. So testing the logs with a t-test works perfectly well as a test of the original hypothesis.

[If you do that test in the logs, I'd be inclined to suggest doing an equal-variance test in that case.]

So - with the mere intervention of perhaps a sentence or two justifying the connection, similar to what I have above - you should be able write your conclusions not about the log of the participation metric, but about the participation metric itself.


3) There's plenty of other things you can do!

a) you can do a test suitable for exponential data. It's easy to derive a likelihood ratio based test. As it happens, for exponential data you get a small-sample F-test (based off a ratio of means) for this situation in the one tailed case; the two tailed LRT would not generally have an equal proportion in each tail for small sample sizes. (This should have better power than the t-test, but the power for the t-test should be quite reasonable, and I'd expect there not to be much difference at your sample sizes.)

b) you can do a permutation-test - even base it on the t-test if you like. So the only thing that changes is the computation of the p-value. Or you might do some other resampling test such as a bootstrap-based test. This should have good power, though it will depend partly on what test statistic you choose relative to the distribution you have.

c) you can do a rank-based nonparametric test (such as the Wilcoxon-Mann-Whitney). If you assume that if the distributions differ, then they differ only by a scale factor (appropriate for a variety of skewed distributions including the exponential), then you can even obtain a confidence interval for the ratio of the scale parameters.

[For that purpose, I'd suggest working on the log-scale (the location shift in the logs being the log of the scale shift). It won't change the p-value, but it will allow you to exponentiate the point estimate and the CI limits to obtain an interval for the scale shift.]

This, too, should tend to have pretty good power if you're in the exponential situation, but likely not as good as using the t-test.


A reference which considers a considerably broader set of cases for the location shift alternative (with both variance and skewness heterogeneity under the null, for example) is

Fagerland, M.W. and L. Sandvik (2009),
"Performance of five two-sample location tests for skewed distributions with unequal variances,"
Contemporary Clinical Trials, 30, 490–496

It generally tends to recommend the Welch U-test (a particular one of the several tests considered by Welch and the only one they tested). If you're not using exactly the same Welch statistic the recommendations may vary somewhat (though probably not by much). [Note that if your distributions are exponential you're interested in a scale alternative unless you take logs ... in which case you won't have unequal variances.]