Reporting Tiny P-Values – How to Report Tiny P-Values and Why R Puts a Minimum at 2.22e-16?

p-valueprecisionrreporting

For some tests in R, there is a lower limit on the calculations of $2.22 \cdot 10^{-16}$. I'm not sure why it's this number, if there is a good reason for it or if it's just arbitrary. A lot of other stats packages just go to 0.0001, so this is a much higher level of precision. But I haven't seen too many papers reporting $p < 2.22\cdot 10^{-16}$ or $p = 2.22\cdot 10^{-16}$.

Is it a common/best practice to report this computed value or is it more typical to report something else (like p < 0.000000000000001)?

Best Answer

There's a good reason for it.

The value can be found via noquote(unlist(format(.Machine)))

           double.eps        double.neg.eps           double.xmin 
         2.220446e-16          1.110223e-16         2.225074e-308 
          double.xmax           double.base         double.digits 
        1.797693e+308                     2                    53 
      double.rounding          double.guard     double.ulp.digits 
                    5                     0                   -52 
double.neg.ulp.digits       double.exponent        double.min.exp 
                  -53                    11                 -1022 
       double.max.exp           integer.max           sizeof.long 
                 1024            2147483647                     4 
      sizeof.longlong     sizeof.longdouble        sizeof.pointer 
                    8                    12                     4 

If you look at the help, (?".Machine"):

double.eps  

the smallest positive floating-point number x such that 1 + x != 1. It equals 
double.base ^ ulp.digits if either double.base is 2 or double.rounding is 0; 
otherwise, it is (double.base ^ double.ulp.digits) / 2. Normally 2.220446e-16.

It's essentially a value below which you can be quite confident the value will be pretty numerically meaningless - in that any smaller value isn't likely to be an accurate calculation of the value we were attempting to compute. (Having studied a little numerical analysis, depending on what computations were performed by the specific procedure, there's a good chance numerical meaninglessness comes in a fair way above that.)

But statistical meaning will have been lost far earlier. Note that p-values depend on assumptions, and the further out into the extreme tail you go the more heavily the true p-value (rather than the nominal value we calculate) will be affected by the mistaken assumptions, in some cases even when they're only a little bit wrong. Since the assumptions are simply not going to be all exactly satisfied, middling p-values may be reasonably accurate (in terms of relative accuracy, perhaps only out by a modest fraction), but extremely tiny p-values may be out by many orders of magnitude.

Which is to say that usual practice (something like the "<0.0001" that's you say is common in packages, or the APA rule that Jaap mentions in his answer) is probably not so far from sensible practice, but the approximate point at which things lose meaning beyond saying 'it's very very small' will of course vary quite a lot depending on circumstances.

This is one reason why I can't suggest a general rule - there can't be a single rule that's even remotely suitable for everyone in all circumstances - change the circumstances a little and the broad grey line marking the change from somewhat meaningful to relatively meaningless will change, sometimes by a long way.

If you were to specify sufficient information about the exact circumstances (e.g. it's a regression, with this much nonlinearity, that amount of variation in this independent variable, this kind and amount of dependence in the error term, that kind of and amount of heteroskedasticity, this shape of error distribution), I could simulate 'true' p-values for you to compare with the nominal p-values, so you could see when they were too different for the nominal value to carry any meaning.

But that leads us to the second reason why - even if you specified enough information to simulate the true p-values - I still couldn't responsibly state a cut-off for even those circumstances.

What you report depends on people's preferences - yours, and your audience. Imagine you told me enough about the circumstances for me to decide that I wanted to draw the line at a nominal $p$ of $10^{-6}$.

All well and good, we might think - except your own preference function (what looks right to you, were you to look at the difference between nominal p-values given by stats packages and the the ones resulting from simulation when you suppose a particular set of failures of assumptions) might put it at $10^{-5}$ and the editors of the journal you want to submit to might put have their blanket rule to cut off at $10^{-4}$, while the next journal might put it at $10^{-3}$ and the next may have no general rule and the specific editor you got might accept even lower values than I gave ... but one of the referees may then have a specific cut off!

In the absence of knowledge of their preference functions and rules, and the absence of knowledge of your own utilities, how do I responsibly suggest any general choice of what actions to take?

I can at least tell you the sorts of things that I do (and I don't suggest this is a good choice for you at all):

There are few circumstances (outside of simulating p-values) in which I would make much of a p less than $10^{-6}$ (I may or may not mention the value reported by the package, but I wouldn't make anything of it other than it was very small, I would usually emphasize the meaningless of the exact number). Sometimes I take a value somewhere in the region of $10^{-5}$ to $10^{-4}$ and say that p was much less than that. On occasion I do actually do as suggested above - perform some simulations to see how sensitive the p-value is in the far tail to various violations of the assumptions, particularly if there's a specific kind of violation I am worried about.

That's certainly helpful in informing a choice - but I am as likely to discuss the results of the simulation as to use them to choose a cut-off-value, giving others a chance to choose their own.

An alternative to simulation is to look at some procedures that are more robust* to the various potential failures of assumption and see how much difference to the p-value that might make. Their p-values will also not be particularly meaningful, but they do at least give some sense of how much impact there might be. If some are very different from the nominal one, it also gives more of an idea which violations of assumptions to investigate the impact of. Even if you don't report any of those alternatives, it gives a better picture of how meaningful your small p-value is.

* Note that here we don't really need procedures that are robust to gross violations of some assumption; ones that are less affected by relatively mild deviations of the relevant assumption should be fine for this exercise.

I will say that when/if you do come to do such simulations, even with quite mild violations, in some cases it can be surprising at how far even not-that-small p-values can be wrong. That has done more to change the way I personally interpret a p-value than it has shifted the specific cut-offs I might use.

When submitting the results of an actual hypothesis test to a journal, I try to find out if they have any rule. If they don't, I tend to please myself, and then wait for the referees to complain.