Solved – If I have many positive, insignificant results, can I test “at least $n$ of these results are positive”

hypothesis testingmultiple-comparisons

Let's say I have run the same regression for 100 different individuals separately. My coefficients of interest are positive (and quite different from one another) but statistically insignificant in all 100 results (lets say every p-value = 0.11).

Is there a way to combine these p-values in order to conclude "at least 80 of these results are positive" with greater significance than p=0.11? My online searches have only showed me how to say "at least 1 of these results is positive" through a Fisher or similar test, but I haven't been able to generalize that result. I want to test "H0 = all 100 effects are the same at 0" against "HA = at least 80 effects are positive."

My goal isn't to say there is a positive coefficient on average, and nor is it to specifically measure the coefficient. My goal is to demonstrate, with significance, that at least 80 people individually faced some positive effect regardless of which 80, and regardless of the magnitude of effect felt by each individual.

Best Answer

You should perform all 100 analyses as a single mixed effects model, with your coefficients of interest random variables themselves. That way you can estimate a distribution for those coefficients including their overall mean, which will give you the sort of interpretation I think you are looking for.

Noting that, if as I suspect is the case, you have a time series for each individual, you will also need to correct for autocorrelation of the residuals.

Related Solutions

P-Value – Are Smaller P-Values More Convincing

Are smaller $p$-values "more convincing"? Yes, of course they are.

In the Fisher framework, $p$-value is a quantification of the amount of evidence against the null hypothesis. The evidence can be more or less convincing; the smaller the $p$-value, the more convincing it is. Note that in any given experiment with fixed sample size $n$, the $p$-value is monotonically related to the effect size, as @Scortchi nicely points out in his answer (+1). So smaller $p$-values correspond to larger effect sizes; of course they are more convincing!

In the Neyman-Pearson framework, the goal is to obtain a binary decision: either the evidence is "significant" or it is not. By choosing the threshold $\alpha$, we guarantee that we will not have more than $\alpha$ false positives. Note that different people can have different $\alpha$ in mind when looking at the same data; perhaps when I read a paper from a field that I am skeptical about, I would not personally consider as "significant" results with e.g. $p=0.03$ even though the authors do call them significant. My personal $\alpha$ might be set to $0.001$ or something. Obviously the lower the reported $p$-value, the more skeptical readers it will be able to convince! Hence, again, lower $p$-values are more convincing.

The currently standard practice is to combine Fisher and Neyman-Pearson approaches: if $p<\alpha$, then the results are called "significant" and the $p$-value is [exactly or approximately] reported and used as a measure of convincingness (by marking it with stars, using expressions as "highly significant", etc.); if $p>\alpha$ , then the results are called "not significant" and that's it.

This is usually referred to as a "hybrid approach", and indeed it is hybrid. Some people argue that this hybrid is incoherent; I tend to disagree. Why would it be invalid to do two valid things at the same time?

Further reading:

Is the "hybrid" between Fisher and Neyman-Pearson approaches to statistical testing really an "incoherent mishmash"? -- my question about the "hybrid". It generated some discussion, but I am still not satisfied with any of the answers, and plan to get back to that thread at some point.
Is it wrong to refer to results as being "highly significant"? -- see my yesterday's answer, which is essentially saying: it isn't wrong (but perhaps a bit sloppy).
Why are lower p-values not more evidence against the null? Arguments from Johansson 2011 -- an example of an anti-Fisher paper arguing that $p$-values do not provide evidence against the null; the top answer by @Momo does a good job in debunking the arguments. My answer to the title question is: But of course they are.

Solved – Proportion Test: Z-test vs bootstrap / simulation – different results

Probability computations. In your first computation your normal approximation is not quite correct. First, because it shouldn't be expected to give more than two digits of accuracy. Second, because you do not use a continuity correction.

Binomial: The P-value can be based on $X \sim \mathsf{Binom}(n=100, p=.8),$ so that $P(X \le 73) = 0.0558,$ as computed in R.

pbinom(73, 100, .8)
[1] 0.05583272

It seems natural to me to frame this as a one-sided test of $H_0: p = .8$ against $H_a: p < .8$ because the CEO would hardly be disappointed at a higher satisfaction rate. In that formulation, the P-value is 0.0558. If you want to test against the two-sided alternative $H_0: p \ne .8,$ then the P-value is 0.1117 (double the P-value for the one-sided test). In either case, the test is not significant at the 5% level.

Normal approximation: The figure below shows why a normal approximation should use a continuity correction, approximating the P-value of a one-sided test as: $$P(X \le 73) = P(X \le 73.5) = P\left(\frac{X - 80}{\sqrt{.8(.2)(100)}} \le \frac{73.5 - 80}{4}\right)\\ \approx P(Z \le -1.625) = 0.0521,$$ where $Z$ is standard normal.

pnorm(-1.625)
[1] 0.05208128

Then the P-value of a two-sided test is .1042. These values are about as good as you can get with a normal approximation. (If you use printed tables rather than software, you may lose a little more accuracy due to rounding.)

Without the continuity correction, the area under the curve between 73.0 and 73.5 is ignored, while the entire probability $P(X = 73)$ is approximated by the area under the curve between 72.5 and 73.5.

Monte Carlo approximations. Definitions of 'bootsrapping' seem to differ from one field of application to another. Your Python code very nearly approximates the correct P-value, essentially by sampling repeatedly from the appropriate hypergeometric (approximately binomial) distribution--in a way I would call Monte Carlo approximation, instead of bootstrapping. But by any name, the procedure seems to be correct.

Notes: (1) In effect, your code is not much different from the sampling done in the following brief R program. [Your code and mine both have the slight advantage of sampling 100 customers without replacement from among a million. But for such a small sample from such a large population, there is hardly any difference between sampling with replacement (binomial) and without (hypergeometric).]

set.seed(1234)
x = rhyper(5000, 800000, 200000, 100)
mean(x <= 73)
[1] 0.0554
2*sd(x <= 73)/sqrt(5000)
[1] 0.006246138            # aprx 95% margin of simulation error is 0.006

With more iterations, as suggested by @MartijnWeterings:

set.seed(4321)
m = 10^6
x = rhyper(m, 800000, 200000, 100)
mean(x <= 73)
[1] 0.05603
2*sd(x <= 73)/sqrt(m)
[1] 0.0004599595

(2) Using the hypergeometric distribution, R finds the one-sided P-value to be 0.05582 (compared with the binomial 0.05583 from the beginning of this Answer).

phyper(73, 800000, 200000, 100)
[1] 0.05582364

Best Answer

Related Solutions

P-Value – Are Smaller P-Values More Convincing

Solved – Proportion Test: Z-test vs bootstrap / simulation – different results

Related Question