Solved – the deal with $p$-value when generating Pearson’s $r$ correlation coefficient

correlation

I know how to generate Pearson's $r$ correlation values in Excel and in R.

I understand the meaning of the value as it ranges from $-1<r<1$

I also understand hypothesis tests, confidence intervals, and $p$-value. ($p$ = the probability that this outcome was due to random chance or natural variation, and null hypothesis is true)

However, I am having trouble making the connection between Pearson's $r$ correlation and $p$-value. In a hypothesis test, there is some element of chance of the variation of outcomes (like coin flips). But, in a regression, it is based on actual data points. So what does p-value mean in this context? The odds that the data points are just clustered due to random chance? Is this directly based on sample size? I ask because I wonder how would a formula know anything about the variability of discrete data points (such as selling price and mileage of a car).

So, for an r value, the p-value is based on sample size?
What I don't understand is that p-value seems to answer a binary question: Is there an effect or not?
But for a correlation coefficient, there isn't a yes/no question being asked.

If I get an r = .8, and p-value of .20, what does that mean?
It means there is a 20% chance that the correlation of .8 is not true?
But, then what IS true? r=.7 ? r=.6?

Or is the rule of thumb that you can only use the r value, regardless of number, if p < .05 ?

If I get an r = .1, and p-value of .0001, what does that mean?
We are very confident there is a very weak correlation? (How ironic?)

If I get an r = .5, and p-value of .0001, what does that mean?
We are very confident there is a moderate correlation?

If I get an r = .5, and p-value of .3, what does that mean?
It means there is a 30% chance that there is a moderate correlation of .5 ?

Best Answer

[Fixed/improved, based on the feedback from @Momo and @whuber]

I believe that in the context of regression the relationship between $p$-value and Pearson's correlation coefficient is the following: $p$-value can be interpreted as probability that correlation (coefficient), determined in a random sampling-based experiment, is the same or larger than the one, determined from the observed data, provided that the null hypothesis is true. In other words, I think that $p$-value in this context is related to hypothesis testing, where hypotheses themselves are correlation-based, as follows:

\begin{multline} \shoveleft{H_0: \text{correlation (of the underlying data-generation process) is zero;}}\\ \shoveleft{H_A: \text{the correlation is not zero.}} \end{multline}

Then, the situation IMHO boils down to the following traditional hypothesis testing interpretation. If $p$-value is small (less than arbitrarily selected significance level $\alpha$, usually equal to 0.05), then you can reject the null hypothesis ("determined correlation is statistically significant"), and, if $p$-value is greater than $\alpha$, than you fail to reject the null ("the correlation is not statistically significant").

In regard to a relationship between $p$-value and sample size $N$, the following formulae present the relationship in question in a mathematical form.

Fisher transformed test statistic of $r$ (aka $z$) is defined as $T(r) = artanh(r)$.

For a bivariate normal distribution, $z$'s standard error depends on sample size $N$, as follows:

\begin{align} SE(T(r)) \approx \frac{1}{\sqrt{N - 3}} \end{align}

Moreover, since the test statistic is approximately normal,

\begin{align} \frac{T(r)}{SE(T(r))} \approx N(0,1) \text{ and } \lim_{N\to\infty} SE(T(r)) = 0 \end{align}

so the standard error in the denominator is getting increasingly smaller for increasingly larger $N$.

P.S. You may also find the following two answers relevant and useful: this and this.

Related Solutions

Correlation Coefficients – Explanation for Pearson’s Larger Than Spearman’s Rank Correlation Coefficient

This is a simple dataset, where the points come alternating from two linear functions:

The pearson correlation detects, there is a general upwards motion in the combined data (red an black together) and is r=.453 The spearman correlation just sees the ranks, which are distributed like this:

There is a high and a low rank alternating, so no clear trend for spearman. Spearman r = .079 This pearson is 5.7 times as high and you can easily increase that value by extending the row. You can even easily get a negative Spearman for a positive Pearson by just leaving out the last value. So there is nothing in the way of a compbination of a large Pearson and a small Spearman r and the above picture is even a bit similar to your's.

You can easily see how I constructed the data by looking at them:

1, -.01, 2, -.02, 3, -.03, 4, -.04, 5, -.05, 6, -.06, 7, -.07, 8, -.08, 9, -.09, 10

Hope that helps, Bernhard

Solved – How to compare concordance correlation coefficient to Pearson’s r

Pearson's r measures linearity, while CCC measures agreement. Imagine a scatterplot between the two measures. High agreement implies that the scatterplot points are close to the 45 degrees line of perfect concordance which runs diagonally to the scatterplot, whereas a high Pearson's r implies that the scatterplot points are close to any straight line.

In practice,

CCC = r * C_b

where r is Pearson's r, and C_b is a bias correction factor.

Therefore, CCC cannot be compared to Pearson's r. To calculate Pearson's r from CCC for a direct comparison you will need to divide CCC by C_b.

Best Answer

Related Solutions

Correlation Coefficients – Explanation for Pearson’s Larger Than Spearman’s Rank Correlation Coefficient

Solved – How to compare concordance correlation coefficient to Pearson’s r

Related Question