Solved – Using MAD as a way of defining a threshold for significance testing

madmedianoutliersrobust

If I have a set of terms each term having a particular frequency associated with it (the number of the times the term has appeared in fixed corpus of papers), then is the following method of significance testing valid?

  1. calculate the median absolute deviation (MAD) of the GO term frequencies in the given corpus,

    for sample $S$ : ${\rm MAD}(S) = 1.4826 \times {\rm median}(|x_{i} – {\rm median}(S) |)$

  2. get ${\rm thresh} = 2.7\times MAD(S) + {\rm median}(S)$

  3. use ${\rm thresh}$ as a threshold above which the GO terms are deemed significantly associated with the given corpus and below which the GO terms are deemed non-siginificant.

Best Answer

I doubt it. Most probably, the distribution of frequency terms is highly skewed. In such a case, using a threshold rule based on an assumption that the underlying data is drawn from a symmetrical distribution will give highly misleading thresholds (and as a result potentially results).

You could try to apply the thresholding rule you propose on a transformed versions of your data using transformations such as the arcsin. The threshold rule you proposed is based on order statistics meaning that the result should not depend on which transformation you use so long as it is a valid transformation (i.e. a monotone function on the domain of your inputs).

An alternative solution that i personally favor because it simplifies interpretations is to use adjusted boxplots.