Solved – Is it wrong to refer to results as “nearly” or “somewhat” significant

hypothesis testingp-valuestatistical significanceterminology

The general consensus on a similar question, Is it wrong to refer to results as being "highly significant"? is that "highly significant" is a valid, though non-specific, way to describe the strength of an association that has a p-value far below your pre-set significance threshold. However, what about describing p-values that are slightly above your threshold? I have seen some papers use terms like "somewhat significant", "nearly significant", "approaching significance", and so on. I find these terms to be a little wishy-washy, in some cases a borderline disingenuous way to pull a meaningful result out of a study with negative results. Are these terms acceptable to describe results that "just miss" your p-value cutoff?

Best Answer

If you want to allow "significance" to admit of degrees then fair enough ("somewhat significant", "fairly significant"), but avoid phrases that suggest you're still wedded to the idea of a threshold, such as "nearly significant", "approaching significance", or "at the cusp of significance" (my favourite from "Still Not Significant" on the blog Probable Error), if you don't want to appear desperate.

Related Solutions

ANOVA F-Test vs Multiple T-Tests – How Much Smaller Can P-Values Be?

Assuming equal $n$s [but see note 2 below] for each treatment in a one-way layout, and that the pooled SD from all the groups is used in the $t$ tests (as is done in usual post hoc comparisons), the maximum possible $p$ value for a $t$ test is $2\Phi(-\sqrt{2}) \approx .1573$ (here, $\Phi$ denotes the $N(0,1)$ cdf). Thus, no $p_t$ can be as high as $0.5$. Interestingly (and rather bizarrely), the $.1573$ bound holds not just for $p_F=.05$, but for any significance level we require for $F$.

The justification is as follows: For a given range of sample means, $\max_{i,j}|\bar y_i - \bar y_j| = 2a$, the largest possible $F$ statistic is achieved when half the $\bar y_i$ are at one extreme and the other half are at the other. This represents the case where $F$ looks the most significant given that two means differ by at most $2a$.

So, without loss of generality, suppose that $\bar y_.=0$ so that $\bar y_i=\pm a$ in this boundary case. And again, without loss of generality, suppose that $MS_E=1$, as we can always rescale the data to this value. Now consider $k$ means (where $k$ is even for simplicity [but see note 1 below]), we have $F=\frac{\sum n\bar y^2/(k-1)}{MS_E}= \frac{kna^2}{k-1}$. Setting $p_F=\alpha$ so that $F=F_\alpha=F_{\alpha,k-1,k(n-1)}$, we obtain $a =\sqrt{\frac{(k-1)F_\alpha}{kn}}$. When all the $\bar y_i$ are $\pm a$ (and still $MS_E=1$), each nonzero $t$ statistic is thus $t=\frac{2a}{1\sqrt{2/n}} = \sqrt{\frac{2(k-1)F_\alpha}{k}}$. This is the smallest maximum $t$ value possible when $F=F_\alpha$.

So you can just try different cases of $k$ and $n$, compute $t$, and its associated $p_t$. But notice that for given $k$, $F_\alpha$ is decreasing in $n$ [but see note 3 below]; moreover, as $n\rightarrow\infty$, $(k-1)F_{\alpha,k-1,k(n-1)} \rightarrow \chi^2_{\alpha,k-1}$; so $t \ge t_{min} =\sqrt{2\chi^2_{\alpha,k-1}/k}$. Note that $\chi^2/k=\frac{k-1}k \chi^2/(k-1)$ has mean $\frac{k-1}k$ and SD$\frac{k-1}k\cdot\sqrt{\frac2{k-1}}$. So $\lim_{k\rightarrow\infty}t_{min} = \sqrt{2}$, regardless of $\alpha$, and the result I stated in the first paragraph above is obtained from asymptotic normality.

It takes a long time to reach that limit, though. Here are the results (computed using R) for various values of $k$, using $\alpha=.05$:

k       t_min    max p_t   [ Really I mean min(max|t|) and max(min p_t)) ]
2       1.960     .0500
4       1.977     .0481   <--  note < .05 !
10      1.840     .0658
100     1.570     .1164
1000    1.465     .1428
10000   1.431     .1526

A few loose ends...

When k is odd: The maximum $F$ statistic still occurs when the $\bar y_i$ are all $\pm a$; however, we will have one more at one end of the range than the other, making the mean $\pm a/k$, and you can show that the factor $k$ in the $F$ statistic is replaced by $k-\frac 1k$. This also replaces the denominator of $t$, making it slightly larger and hence decreasing $p_t$.
Unequal $n$s: The maximum $F$ is still achieved with the $\bar y_i = \pm a$, with the signs arranged to balance the sample sizes as nearly equally as possible. Then the $F$ statistic for the same total sample size $N = \sum n_i$ will be the same or smaller than it is for balanced data. Moreover, the maximum $t$ statistic will be larger because it will be the one with the largest $n_i$. So we can't obtain larger $p_t$ values by looking at unbalanced cases.
A slight correction: I was so focused on trying to find the minimum $t$ that I overlooked the fact that we are trying to maximize $p_t$, and it is less obvious that a larger $t$ with fewer df won't be less significant than a smaller one with more df. However, I verified that this is the case by computing the values for $n=2,3,4,\ldots$ until the df are high enough to make little difference. For the case $\alpha=.05, k\ge 3$ I did not see any cases where the $p_t$ values did not increase with $n$. Note that the $df=k(n-1)$ so the possible df are $k,2k,3k,\ldots$ which get large fast when $k$ is large. So I'm still on safe ground with the claim above. I also tested $\alpha=.25$, and the only case I observed where the $.1573$ threshold was exceeded was $k=3,n=2$.

Statistical Significance – Is It Wrong to Refer to Results as Being ‘Highly Significant’?

I think there is not much wrong in saying that the results are "highly significant" (even though yes, it is a bit sloppy).

It means that if you had set a much smaller significance level $\alpha$, you would still have judged the results as significant. Or, equivalently, if some of your readers have a much smaller $\alpha$ in mind, then they can still judge your results as significant.

Note that the significance level $\alpha$ is in the eye of the beholder, whereas the $p$-value is (with some caveats) a property of the data.

Observing $p=10^{-10}$ is just not the same as observing $p=0.04$, even though both might be called "significant" by standard conventions of your field ($\alpha=0.05$). Tiny $p$-value means stronger evidence against the null (for those who like Fisher's framework of hypothesis testing); it means that the confidence interval around the effect size will exclude the null value with a larger margin (for those who prefer CIs to $p$-values); it means that the posterior probability of the null will be smaller (for Bayesians with some prior); this is all equivalent and simply means that the findings are more convincing. See Are smaller p-values more convincing? for more discussion.

The term "highly significant" is not precise and does not need to be. It is a subjective expert judgment, similar to observing a surprisingly large effect size and calling it "huge" (or perhaps simply "very large"). There is nothing wrong with using qualitative, subjective descriptions of your data, even in the scientific writing; provided of course, that the objective quantitative analysis is presented as well.

See also some excellent comments above, +1 to @whuber, @Glen_b, and @COOLSerdash.

Best Answer

Related Solutions

ANOVA F-Test vs Multiple T-Tests – How Much Smaller Can P-Values Be?

Statistical Significance – Is It Wrong to Refer to Results as Being ‘Highly Significant’?

Related Question