Solved – Cohen’s d vs. p-value

cohens-dp-value

I recently started reading Evidence Base Update for Autism Spectrum Disorder. In Table 4 they present a summary of previous work. They use Cohen's $d$ values.

From here :

$d$ = $|\frac{M_{1} – M_{2}}{\sigma}|$

where $M_{1}$ and $M_{2}$ are the group means and $\sigma$ is the standard deviation of one of the groups. This seems a bit fraught with peril given that you have the freedom to choose the lowest $\sigma$. It is stated that a 'large' effect would $d \gt 1.3 $.

Assuming our groups are normal distributions and using the 'canonical' value for statistical significance in the biological literature, p-value $\lt$ 0.05, it does not seem that a $d \gt 1.3 $ would even be close to 'statistically significant.'

To get a statistically significant value (assuming $p \lt 0.05$), it seems you'd need a $d > 2.0$. I'm interpreting this as 2 $\sigma$ separation and ignoring the complications of significant discrepancies in $\sigma$ (i.e. $\sigma_{1} \gg \sigma_{2}$)

It seems that the basis for evidence in the psychological literature is even weaker than the general biological literature.

Question :

Is my interpretation of the relation to Cohen's $d$ and the p-value correct?
Why are they using Cohen's $d$ instead of p-values? Is this really a valid way of showing statistical significance?

Best Answer

Your interpretation is not correct. Cohen's d is a measure of effect size - basically, how many standard deviations does the outcome (e.g. executive function) change for the average treatment recipient.

The tl;dr and extremely oversimplified explanation of a p-value is that it shows how surprised you should be at seeing whatever result you had. If there were truly zero difference between groups, what is the probability that you saw the results you did, given your sample size? That's a p-value.

If you have a very small study, you need a large effect size for it to register as significant. Conversely, if you have an extremely large study, you could detect a very small effect size, e.g. a d of 0.05, but that effect size may not be practically significant. And that is why they were using Cohen's d: statistical significance and practical significance can diverge. The people doing systematic literature reviews need to consider both. Consumers of any research should consider both.

Related Solutions

Solved – Use of a negative binomial model for fitting alternative splicing event

Some comments first:

You need a testable hypothesis before getting p-values. That means that you need to describe what the data would look like without and with the effect of interest. You can do parameter estimation without hypothesis testing, but then you need to specify what parameter is of interest, and there are no p-values.
While your values are counts, they represent the number of reads falling into each category with a fixed total per experimental run. Or more precisely, the total is random, but you want to condition on it (you keep saying things like "out of a sample of b events"). If so, your counts do not have a negative binomial distribution, but rather a binomial distribution.

The following assumes such a conditioning on the total number of reads per experimental run. The right analysis approach still depends on whether your goal is to estimate hte proportion of exon-skipping reads for each junction (with a confidence interval) or do you want to be able to estimate the entire distribution of the number of exon-skipping reads given the total number of reads. You seem to be asking for the latter, but I am not sure whether that's what you really need, as it is not clear what are you planning to do with the results of the analysis.

You have very little information to inform the model, so you have to make major assumptions. Based on a preliminary analysis using quasi-binomial regression it appears that there might be overdispersion: the probability of an exon-skipping read varies somewhat between the replicates. If that's true, binomial regression cannot be used, but you could consider beta-binomial regression. Here I will show the binomial regression.

First, you have to set up the data so that counts from the same replicate are in the same row.

d1 <- data.frame(Rep=1:3, Skip=c(8,0,0), Normal=c(12,6,8))

And then we can use a binomial glm:

m1 <- glm(cbind(Skip,Normal) ~ 1, data=d1, family=binomial)
summary(m1)



Call:
glm(formula = cbind(Skip, Normal) ~ 1, family = binomial, data = d1)

Deviance Residuals: 
     1       2       3  
 1.634  -1.794  -2.072  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  -1.1787     0.4043  -2.915  0.00355 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 10.18  on 2  degrees of freedom
Residual deviance: 10.18  on 2  degrees of freedom
AIC: 15.613

Number of Fisher Scoring iterations: 4

Note that the p-value shown is probably meaningless to you. It tests whether the parameter is 0, and that corresponds to testing whether the probability of exon-skipping is 0.5.

The predict function can be used to get the predicted probability of exon-skipping for each replicate, and the dbinom function can can be used to get the individual response probabilities:

p1 <- predict(m1, type="response")
> p1
        1         2         3 
0.2352941 0.2352941 0.2352941 
> newtotal <- 10
> dbinom(0:newtotal, size=newtotal, p=p1[1])
 [1] 6.838240e-02 2.104074e-01 2.913333e-01 2.390427e-01 1.287153e-01 4.752565e-02
 [7] 1.218606e-02 2.142605e-03 2.472236e-04 1.690418e-05 5.201286e-07

So, for example, the probability of having 0 exon-skipping reads out of 10 reads at junciton 1 is estimated to be 0.06828.

Solved – Is p-value essentially useless and dangerous to use

Here are some thoughts:

As @whuber notes, I doubt Gelman said that (although he may have said something similar sounding). Five percent of cases where the null is true will yield significant results (type I errors) using an alpha of .05. If we assume that the true power for all studies where the null was false were $80\%$, the statement could only be true if the ratio of studies undertaken where the null was true to studies in which the null was false was $100/118.75 \approx 84\%$.
Model selection criteria, such as the AIC, can be seen as a way of selecting an appropriate $p$-value. To understand this more fully, it may help to read @Glen_b's answer here: Stepwise regression in R – Critical p-value. Moreover, nothing prevents people from 'AIC-hacking', if the AIC became the requirement for publication.
A good guide to fitting models in such a manner that you don't invalidate your $p$-values would be Frank Harrell's book, Regression Modeling Strategies.
I am not dogmatically opposed to using Bayesian methods, but I do not believe they would solve this problem. For example, you can just keep collecting data until the credible interval no longer included whatever value you wanted to reject. Thus you have 'credible interval-hacking'. As I see it, the issue is that many practitioners are not intrinsically interested in the statistical analyses they use, so they will use whichever method is required of them in an unthinking and mechanical way. For more on my perspective here, it may help to read my answer to: Effect size as the hypothesis for significance testing.

Best Answer

Related Solutions

Solved – Use of a negative binomial model for fitting alternative splicing event

Solved – Is p-value essentially useless and dangerous to use

Related Question