Statistical Assumptions – Difference Between Assumptions Underlying Correlation and Regression Slope Tests of Significance

assumptionscorrelationp-valueregression

My question grew out of a discussion with @whuber in the comments of a different question.

Specifically, @whuber 's comment was as follows:

One reason it might surprise you is that the assumptions underlying a correlation test and a regression slope test are different–so even when we understand that the correlation and slope are really measuring the same thing, why should their p-values be the same? That shows how these issues go deeper than simply whether $r$ and $\beta$ should be numerically equal.

This got my thinking about it and I came across a variety of interesting answers. For example, I found this question "Assumptions of correlation coefficient" but can't see how this would clarify the comment above.

I found more interesting answers about the relationship of Pearson's $r$ and the slope $\beta$ in a simple linear regression (see here and here for example) but none of them seem to answer what @whuber was referring to in his comment (at least not apparent to me).

Question 1: What are the assumptions underlying a correlation test and a regression slope test?

For my 2nd question consider the following outputs in R:

model <- lm(Employed ~ Population, data = longley)
summary(model)

Call:
lm(formula = Employed ~ Population, data = longley)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.4362 -0.9740  0.2021  0.5531  1.9048 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   8.3807     4.4224   1.895   0.0789 .  
Population    0.4849     0.0376  12.896 3.69e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.013 on 14 degrees of freedom
Multiple R-squared:  0.9224,    Adjusted R-squared:  0.9168 
F-statistic: 166.3 on 1 and 14 DF,  p-value: 3.693e-09

And the output of the cor.test() function:

with(longley, cor.test(Population, Employed))

    Pearson's product-moment correlation

data:  Population and Employed
t = 12.8956, df = 14, p-value = 3.693e-09
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8869236 0.9864676
sample estimates:
      cor 
0.9603906 

As can be seen by the lm() and cov.test() output, the Pearson's correlation coefficient $r$ and the slope estimate ($\beta_1$) are largely different, 0.96 vs. 0.485, respectively, but the t-value and the p-values are the same.

Then I also tried to see if I am able to calculate the t-value for $r$ and $\beta_1$, which are the same despite $r$ and $\beta_1$ being different. And that's where I get stuck, at least for $r$:

Calculate the the slope ($\beta_1$) in a simple linear regression using the total sums of squares of $x$ and $y$:

x <- longley$Population; y <- longley$Employed
xbar <- mean(x); ybar <- mean(y)
ss.x <- sum((x-xbar)^2)
ss.y <- sum((y-ybar)^2)
ss.xy <- sum((x-xbar)*(y-ybar))

Calculate the least-squares estimate of the regression slope, $\beta_{1}$ (there is a proof of this in Crawley's R Book 1st edition, page 393):

b1 <- ss.xy/ss.x                        
b1
# [1] 0.4848781

Calculate the standard error for $\beta_1$:

ss.residual <- sum((y-model$fitted)^2)
n <- length(x) # SAMPLE SIZE
k <- length(model$coef) # NUMBER OF MODEL PARAMETER (i.e. b0 and b1)
df.residual <- n-k
ms.residual <- ss.residual/df.residual # RESIDUAL MEAN SQUARE
se.b1 <- sqrt(ms.residual/ss.x)
se.b1
# [1] 0.03760029

And the t-value and p-value for $\beta_1$:

t.b1 <- b1/se.b1
p.b1 <- 2*pt(-abs(t.b1), df=n-2)
t.b1
# [1] 12.89559
p.b1
# [1] 3.693245e-09

What I don't know at this point, and this is Question 2, is, how to calculate the same t-value using $r$ instead of $\beta_1$ (perhaps in baby-steps)?

I assume that since cor.test()'s alternative hypothesis is whether the true correlation is not equal to 0 (see cor.test() output above), I would expect something like the Pearson correlation coefficient $r$ divided by the "standard error of the Pearson correlation coefficient" (similar to the b1/se.b1 above)?! But what would that standard error be and why?

Maybe this has something to do with the aforementioned assumptions underlying a correlation test and a regression slope test?!

EDIT (27-Jul-2017):
While @whuber provided a very detailed explanation for Question 1 (and partly Question 2, see comments under his answer), I did some further digging and found that these two posts (here and here) do show a specific standard error for $r$, which works well to answer Question 2, that is to reproduce the t-value given $r$:

r <- 0.9603906
# n <- 16
r.se <- sqrt((1-r^2)/(n-2))
r/r.se
# [1] 12.8956

Best Answer

Introduction

This reply addresses the underlying motivation for this set of questions:

What are the assumptions underlying a correlation test and a regression slope test?

In light of the background provided in the question, though, I would like to suggest expanding this question a little: let us explore the different purposes and conceptions of correlation and regression.

Correlation typically is invoked in situations where

  • Data are bivariate: exactly two distinct values of interest are associated with each "subject" or "observation".

  • The data are observational: neither of the values was set by the experimenter. Both were observed or measured.

  • Interest lies in identifying, quantifying, and testing some kind of relationship between the variables.

Regression is used where

  • Data are bivariate or multivariate: there may be more than two distinct values of interest.

  • Interest focuses on understanding what can be said about a subset of the variables--the "dependent" variables or "responses"--based on what might be known about the other subset--the "independent" variables or "regressors."

  • Specific values of the regressors may have been set by the experimenter.

These differing aims and situations lead to distinct approaches. Because this thread is concerned about their similarities, let's focus on the case where they are most similar: bivariate data. In either case those data will typically be modeled as realizations of a random variable $(X,Y)$. Very generally, both forms of analysis seek relatively simple characterizations of this variable.

Correlation

I believe "correlation analysis" has never been generally defined. Should it be limited to computing correlation coefficients, or could it be considered more extensively as comprising PCA, cluster analysis, and other forms of analysis that relate two variables? Whether your point of view is narrowly circumscribed or broad, perhaps you would agree that the following description applies:

Correlation is an analysis that makes assumptions about the distribution of $(X,Y)$, without privileging either variable, and uses the data to draw more specific conclusions about that distribution.

For instance, you might begin by assuming $(X,Y)$ has a bivariate Normal distribution and use the Pearson correlation coefficient of the data to estimate one of the parameters of that distribution. This is one of the narrowest (and oldest) conceptions of correlation.

As another example, you might being by assuming $(X,Y)$ could have any distribution and use a cluster analysis to identify $k$ "centers." One might construe that as the beginnings of a resolution of the distribution of $(X,Y)$ into a mixture of unimodal bivariate distributions, one for each cluster.

One thing common to all these approaches is a symmetric treatment of $X$ and $Y$: neither is privileged over the other. Both play equivalent roles.

Regression

Regression enjoys a clear, universally understood definition:

Regression characterizes the conditional distribution of $Y$ (the response) given $X$ (the regressor).

Historically, regression traces its roots to Galton's discovery (c. 1885) that bivariate Normal data $(X,Y)$ enjoy a linear regression: the conditional expectation of $Y$ is a linear function of $X$. At one pole of the special-general spectrum is Ordinary Least Squares (OLS) regression where the conditional distribution of $Y$ is assumed to be Normal$(\beta_0+\beta_1 X, \sigma^2)$ for fixed parameters $\beta_0, \beta_1,$ and $\sigma$ to be estimated from the data.

At the extremely general end of this spectrum are generalized linear models, generalized additive models, and others of their ilk that relax all aspects of OLS: the expectation, variance, and even the shape of the conditional distribution of $Y$ may be allowed to vary nonlinearly with $X$. The concept that survives all this generalization is that interest remains focused on understanding how $Y$ depends on $X$. That fundamental asymmetry is still there.

Correlation and Regression

One very special situation is common to both approaches and is frequently encountered: the bivariate Normal model. In this model, a scatterplot of data will assume a classic "football," oval, or cigar shape: the data are spread elliptically around an orthogonal pair of axes.

  • A correlation analysis focuses on the "strength" of this relationship, in the sense that a relatively small spread around the major axis is "strong."

  • As remarked above, the regression of $Y$ on $X$ (and, equally, the regression of $X$ on $Y$) is linear: the conditional expectation of the response is a linear function of the regressor.

(It is worthwhile pondering the clear geometric differences between these two descriptions: they illuminate the underlying statistical differences.)

Of the five bivariate Normal parameters (two means, two spreads, and one more that measures the dependence between the two variables), one is of common interest: the fifth parameter, $\rho$. It is directly (and simply) related to

  1. The coefficient of $X$ in the regression of $Y$ on $X$.

  2. The coefficient of $Y$ in the regression of $X$ on $Y$.

  3. The conditional variances in either of the regressions $(1)$ and $(2)$.

  4. The spreads of $(X,Y)$ around the axes of an ellipse (measured as variances).

A correlation analysis focuses on $(4)$, without distinguishing the roles of $X$ and $Y$.

A regression analysis focuses on the versions of $(1)$ through $(3)$ appropriate to the choice of regressor and response variables.

In both cases, the hypothesis $H_0: \rho=0$ enjoys a special role: it indicates no correlation as well as no variation of $Y$ with respect to $X$. Because (in this simplest situation) both the probability model and the null hypothesis are common to correlation and regression, it should be no surprise that both methods share an interest in the same statistics (whether called "$r$" or "$\hat\beta$"); that the null sampling distributions of those statistics are the same; and (therefore) that hypothesis tests can produce identical p-values.

This common application, which is the first one anybody learns, can make it difficult to recognize just how different correlation and regression are in their concepts and aims. It is only when we learn about their generalizations that the underlying differences are exposed. It would be difficult to construe a GAM as giving much information about "correlation," just as it would be hard to frame a cluster analysis as a form of "regression." The two are different families of procedures with different objectives, each useful in its own right when applied appropriately.


I hope that this rather general and somewhat vague review has illuminated some of the ways in which "these issues go deeper than simply whether $r$ and $\hat\beta$ should be numerically equal." An appreciation of these differences has helped me understand what various techniques are attempting to accomplish, as well as to make better use of them in solving statistical problems.

Related Question