Hypothesis Testing – Should t-Test Be Used on Highly Skewed and Discrete Data?

hypothesis testingmeannonparametrict-test

I have samples from a highly skewed dataset about users' participation (e.g.: number of posts), that have different sizes (but not less than 200) and I want to compare their mean. For that, I'm using two-sample unpaired t-tests(and t-tests with the Welch’s factor, when the samples had different variances). As I have heard that, for really large samples, it doesn't matter that the sample are not normal distributed.

My metrics are discrete, they are counts of the number of each user's participation. Of course we have those users who participate much more than the others, but I'm not considering them as outliers. Here are the data description: https://docs.google.com/spreadsheets/d/1WhSKgYIuP35eRsukHVoUFUlITNwO_RRcYoOoR9EmXHg/edit?usp=sharing

My problem: someone, reviewing what I've done, said that the tests I am using were not suitable for my data. They suggested to log-transform my samples before using the t-tests.

I do know that I can't log-transform these, because all of them have zero-values on the samples. My guess is, if I can't use t-test, I should use the Mann Whitney U test.

Are they wrong? Am I wrong? If they are wrong, is there a book or scientific paper which I could cite/show them? If I am wrong, which test should I use?

Best Answer

Highly discrete and skew variables can exhibit some particular issues in their t-statistics:

For example, consider something like this:

enter image description here

(it has a bit more of a tail out to the right, that's been cut off, going out to 90-something)

The distribution of two-sample t-statistics for samples of size 50 look something like this:

enter image description here

In particular, there are somewhat short tails and a noticeable spike at 0.

Issues like these suggest that simulation from distributions that look something like your sample might be necessary to judge whether the sample size is 'large enough'

Your data seems to have somewhat more of a tail than in my above example, but your sample size is much larger (I was hoping for something like a frequency table). It may be okay, but you could either simulate form some models in the neighborhood of your sample distribution (or you could resample your data) to get some idea of whether those sample sizes would be sufficient to treat the distribution of your test statistics as approximately $t$.

Simulation study A - t.test significance level (based on the supplied frequency tables)

Here I resampled your frequency tables to get a sense of the impact of distributions like you have on the inference from a t-test. I did two simulations, both using your sample sizes for the UsersX and UsersY groups, but in the first instance sampling from the X-data for both and in the second instance sampling from the Y-data for both (to get the H0 true situation)

The results were (not surprisingly given the similarity in shape) fairly similar:

enter image description here

The distribution of p-values should look like a uniform distribution. The reason why it doesn't is probably for the same reason we see a spike in the histogram of the t-statistic I drew earlier - while the general shape is okay, there's a distinct possibility of a mean difference of exactly zero. This spike inflates the type 1 error rate - lifting a 5% significance level to roughly 7.5 or 8 percent:

> sum(tpres1<.05)/length(tpres1)
[1] 0.0769

> sum(tpres2<.05)/length(tpres2)
[1] 0.0801

This is not necessarily a problem - if you know about it. You could, for example, (a) do the test "as is", keeping in mind you will get a somewhat higher type I error rate; or (b) drop the nominal type I error rate by about half (or even a bit more, since it affects smaller significance levels relatively more than larger ones).

My suggestion - if you want to do a t-test - would instead be to use the t-statistic but to do a resampling-based test (do a permutation/randomization test or, if you prefer, do a bootstrap test).

Simulation study B - Mann-Whitney test significance level (based on the supplied frequency tables)

To my surprise, by contrast, the Mann-Whitney is quite level-robust at this sample size. This contradicts a couple of sets of published recommendations that I've seen (admittedly conducted at lower sample sizes).

> sum(mwpres1<.05)/length(mwpres1)
[1] 0.0509

> sum(mwpres2<.05)/length(mwpres2)
[1] 0.0482

(the histograms for this case appear uniform, so this should work similarly at other typical significance levels)

Significance levels of 4.8 and 5.1 percent (with standard error 0.22%) are excellent with distributions like these.

On this basis I'd say that - on significance level at least - the Mann Whitney is performing quite well. We'd have to do a power study to see the impact on power, but I don't expect it would do too badly compared to say the t-test (if we adjust things so they're at about the same actual significance level).

So I have to eat my previous words - my caution on the Mann-Whitney looks to be unnecessary at this sample size.

My R code for reading in the frequency tables

#metric1 sample1
UsersX=data.frame(
     count=c(182L, 119L, 41L, 11L, 7L, 5L, 5L, 3L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
     value=c(0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 12L, 17L, 18L, 20L, 29L, 35L, 42L)
             )

#metric 1 sample2
UsersY=data.frame(
    count=c(5098L, 2231L, 629L, 288L, 147L, 104L, 50L, 39L, 28L, 22L, 12L, 14L, 8L, 8L, 
     9L, 5L, 2L, 5L, 5L, 4L, 1L, 3L, 2L, 1L, 1L, 4L, 1L, 4L, 1L, 1L, 1L, 1L, 1L, 1L),
    value=c(0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 
     17L, 18L, 19L, 20L, 21L, 22L, 25L, 26L, 27L, 28L, 31L, 33L, 37L, 40L, 44L, 50L, 76L)

My R code for doing simulations

resample=function(tbl,n=sum(tbl$count))                                           #$
                  sample(tbl$value,size=n,replace=TRUE,prob=tbl$count)            #$

n1=sum(UsersX$count)                                                              #$
n2=sum(UsersY$count)                                                              #$
tpres1=replicate(10000,t.test(resample(UsersX),resample(UsersX,n2))$p.value)      #$
tpres2=replicate(10000,t.test(resample(UsersY,n1),resample(UsersY))$p.value)      #$

mwpres1=replicate(10000,wilcox.test(resample(UsersX),resample(UsersX,n2))$p.value)#$
mwpres2=replicate(10000,wilcox.test(resample(UsersY,n1),resample(UsersY))$p.value)#$

# "#$" at end of each line avoids minor issue with rendering R code containing "$"

Related Solutions

Permutation Tests – Which Permutation Test Implementation in R to Use Instead of T-Tests (Paired and Non-Paired)?

It shouldn't matter that much since the test statistic will always be the difference in means (or something equivalent). Small differences can come from the implementation of Monte-Carlo methods. Trying the three packages with your data with a one-sided test for two independent variables:

DV <- c(x1, y1)
IV <- factor(rep(c("A", "B"), c(length(x1), length(y1))))
library(coin)                    # for oneway_test(), pvalue()
pvalue(oneway_test(DV ~ IV, alternative="greater", 
                   distribution=approximate(B=9999)))
[1] 0.00330033

library(perm)                    # for permTS()
permTS(DV ~ IV, alternative="greater", method="exact.mc", 
       control=permControl(nmc=10^4-1))$p.value
[1] 0.003

library(exactRankTests)          # for perm.test()
perm.test(DV ~ IV, paired=FALSE, alternative="greater", exact=TRUE)$p.value
[1] 0.003171822

To check the exact p-value with a manual calculation of all permutations, I'll restrict the data to the first 9 values.

x1 <- x1[1:9]
y1 <- y1[1:9]
DV <- c(x1, y1)
IV <- factor(rep(c("A", "B"), c(length(x1), length(y1))))
pvalue(oneway_test(DV ~ IV, alternative="greater", distribution="exact"))
[1] 0.0945907

permTS(DV ~ IV, alternative="greater", exact=TRUE)$p.value
[1] 0.0945907

# perm.test() gives different result due to rounding of input values
perm.test(DV ~ IV, paired=FALSE, alternative="greater", exact=TRUE)$p.value
[1] 0.1029412

# manual exact permutation test
idx  <- seq(along=DV)                 # indices to permute
idxA <- combn(idx, length(x1))        # all possibilities for different groups

# function to calculate difference in group means given index vector for group A
getDiffM <- function(x) { mean(DV[x]) - mean(DV[!(idx %in% x)]) }
resDM    <- apply(idxA, 2, getDiffM)  # difference in means for all permutations
diffM    <- mean(x1) - mean(y1)       # empirical differencen in group means

# p-value: proportion of group means at least as extreme as observed one
(pVal <- sum(resDM >= diffM) / length(resDM))
[1] 0.0945907

coin and exactRankTests are both from the same author, but coin seems to be more general and extensive - also in terms of documentation. exactRankTests is not actively developed anymore. I'd therefore choose coin (also because of informative functions like support()), unless you don't like to deal with S4 objects.

EDIT: for two dependent variables, the syntax is

id <- factor(rep(1:length(x1), 2))    # factor for participant
pvalue(oneway_test(DV ~ IV | id, alternative="greater",
                   distribution=approximate(B=9999)))
[1] 0.00810081

Solved – What statistical test to use for the questionnaire

Some points:

Each variable that defines a category or type ("young vs old", "male vs female") is a factor that influences the outcome (score on the questionnaire). You are doing a Factorial Experiment. You need to list all the relevant factors, their possible value, and mount an analysis using some software. First thing to look into is interaction between factors - for an example, usually ages makes math scores lower, except on Asians. The factor Race interacts with the factor Age, then. Then check homoscedasticity, then check normality. If all goes right, you can then proceed to ANOVA and / or t-tests.
Giving information to the subjects makes it a paired-sample controlled trial or something like it :) The results of this t-test still are affected by the factors.
Due to sample size (and some probable bias), I would go with presenting the information graphically and using non-parametric tests, like Kolmorogov-Smirnov to compare the two distributions.

Best Answer

Related Solutions

Permutation Tests – Which Permutation Test Implementation in R to Use Instead of T-Tests (Paired and Non-Paired)?

Solved – What statistical test to use for the questionnaire

Related Question