Regression – When to Look at the P Value of the Slope vs. P Value of a Correlation Test

linearp-valueregression

This question was edited to clairify my question, for the old question see the edit log
I found this about regression and correlation:

Regression is different from correlation because it try to put
variables into equation and thus explain causal relationship between
them, for example the most simple linear equation is written : Y=aX+b

Based on the infromation mentioned here:

However such results do not allow any causal explanation of the effect
of x on y, indeed x could act on y in various way that are not always
direct, all we can say from the correlation is that these two
variables are linked somehow, to really explain and measure causal
effect of x on y we need to use regression method, which will come
next.

If we plot data and it shows a clear linear trend we can test if this linear trend is significant using a correlation test (I assume), if this is the case we can apply a linear model to this and then inspect the p value of the slope to determine if the slope is probably the same as we could expect in our popultion.

I'm not sure if the above assumption is correct so I'm wondering what the p value of the correlation test tells us and what the P value of the slope tells us?

Best Answer

There are different questions in this question. Neither correlation nor linear regression can prove causal relationship. But in your mind and in the model, the correlation is not directed but regression is. There is no difference in correlation, whether you think one value is the reason for the other whereas the formulation of a linear regression modell usually implies a direction. At least with ordinary least squares, it is not the same, whether you write $Y = aX+b$ or $X = cY+d$. However $cor(X,Y) = cor(Y,X)$.

Correlation and linear regression are familar, but the link is the $R^2$ value which results from linear regression and is indeed the square of the correlation coefficient $r$. You have not mentioned $R^2$ in your post so maybe this will help to get a better understanding.

The p-value mainly tells you, whether you sampled a large enough sample to conclude, which sign the correlation coefficient a and the regression coefficient r have.

Related Solutions

Solved – Does “correlation” also mean the slope in regression analysis

First, he said he would run a regression analysis, then he showed us the analysis of variance. Why?

Analysis of variance (ANOVA) is just a technique comparing the variance explained by the model versus the variance not explained by the model. Since regression models have both the explained and unexplained component, it's natural that ANOVA can be applied to them. In many software packages, ANOVA results are routinely reported with linear regression. Regression is also a very versatile technique. In fact, both t-test and ANOVA can be expressed in regression form; they are just a special case of regression.

For example, here is a sample regression output. The outcome is miles per gallon of some cars and the independent variable is whether the car was domestic or foreign:

      Source |       SS       df       MS              Number of obs =      74
-------------+------------------------------           F(  1,    72) =   13.18
       Model |  378.153515     1  378.153515           Prob > F      =  0.0005
    Residual |  2065.30594    72  28.6848048           R-squared     =  0.1548
-------------+------------------------------           Adj R-squared =  0.1430
       Total |  2443.45946    73  33.4720474           Root MSE      =  5.3558

------------------------------------------------------------------------------
         mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   1.foreign |   4.945804   1.362162     3.63   0.001     2.230384    7.661225
       _cons |   19.82692   .7427186    26.70   0.000     18.34634    21.30751
------------------------------------------------------------------------------

You can see the ANOVA reported at top left. The overall F-statistics is 13.18, with a p-value of 0.0005, indicating the model being predictive. And here is the ANOVA output:

                       Number of obs =      74     R-squared     =  0.1548
                       Root MSE      = 5.35582     Adj R-squared =  0.1430

              Source |  Partial SS    df       MS           F     Prob > F
          -----------+----------------------------------------------------
               Model |  378.153515     1  378.153515      13.18     0.0005
                     |
             foreign |  378.153515     1  378.153515      13.18     0.0005
                     |
            Residual |  2065.30594    72  28.6848048   
          -----------+----------------------------------------------------
               Total |  2443.45946    73  33.4720474

Notice that you can recover the same F-statistics and p-value there.

And then he wrote about the correlation coefficient, is that not from correlation analysis? Or this word could also be used to describe regression slope?

Assuming the analysis involved using only B and Y, technically I would not agree with the word choice. In most of the cases, slope and correlation coefficient cannot be used interchangeably. In one special case, these two are the same, that is when both the independent and dependent variables are standardized (aka in the unit of z-score.)

For example, let's correlate miles per gallon and the price of the car:

             |    price      mpg
-------------+------------------
       price |   1.0000
         mpg |  -0.4686   1.0000

And here is the same test, using the standardized variables, you can see the correlation coefficient remains unchanged:

             |  sdprice    sdmpg
-------------+------------------
     sdprice |   1.0000
       sdmpg |  -0.4686   1.0000

Now, here are the two regression models using the original variables:

. reg mpg price

      Source |       SS       df       MS              Number of obs =      74
-------------+------------------------------           F(  1,    72) =   20.26
       Model |  536.541807     1  536.541807           Prob > F      =  0.0000
    Residual |  1906.91765    72  26.4849674           R-squared     =  0.2196
-------------+------------------------------           Adj R-squared =  0.2087
       Total |  2443.45946    73  33.4720474           Root MSE      =  5.1464

------------------------------------------------------------------------------
         mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       price |  -.0009192   .0002042    -4.50   0.000    -.0013263   -.0005121
       _cons |   26.96417   1.393952    19.34   0.000     24.18538    29.74297
------------------------------------------------------------------------------

... and here is the one with standardized variables:

. reg sdmpg sdprice

      Source |       SS       df       MS              Number of obs =      74
-------------+------------------------------           F(  1,    72) =   20.26
       Model |  16.0295482     1  16.0295482           Prob > F      =  0.0000
    Residual |  56.9704514    72  .791256269           R-squared     =  0.2196
-------------+------------------------------           Adj R-squared =  0.2087
       Total |  72.9999996    73  .999999994           Root MSE      =  .88953

------------------------------------------------------------------------------
       sdmpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     sdprice |  -.4685967   .1041111    -4.50   0.000    -.6761384   -.2610549
       _cons |  -7.22e-09   .1034053    -0.00   1.000    -.2061347    .2061347
------------------------------------------------------------------------------

As you can see, the slope of the original variables is -0.0009192, and the one with standardized variables is -0.4686, which is also the correlation coefficient.

So, unless the A, B, C, and Y are standardized, I would not agree with the article's "correlating." Instead, I'd just opt of a one unit increase in B is associated with the average of Y being 0.27 higher.

In more complicated situation, where more than one independent variable is involved, the phenomenon described above will no longer be true.

Statistical Assumptions – Difference Between Assumptions Underlying Correlation and Regression Slope Tests of Significance

Introduction

This reply addresses the underlying motivation for this set of questions:

What are the assumptions underlying a correlation test and a regression slope test?

In light of the background provided in the question, though, I would like to suggest expanding this question a little: let us explore the different purposes and conceptions of correlation and regression.

Correlation typically is invoked in situations where

Data are bivariate: exactly two distinct values of interest are associated with each "subject" or "observation".
The data are observational: neither of the values was set by the experimenter. Both were observed or measured.
Interest lies in identifying, quantifying, and testing some kind of relationship between the variables.

Regression is used where

Data are bivariate or multivariate: there may be more than two distinct values of interest.
Interest focuses on understanding what can be said about a subset of the variables--the "dependent" variables or "responses"--based on what might be known about the other subset--the "independent" variables or "regressors."
Specific values of the regressors may have been set by the experimenter.

These differing aims and situations lead to distinct approaches. Because this thread is concerned about their similarities, let's focus on the case where they are most similar: bivariate data. In either case those data will typically be modeled as realizations of a random variable $(X,Y)$. Very generally, both forms of analysis seek relatively simple characterizations of this variable.

Correlation

I believe "correlation analysis" has never been generally defined. Should it be limited to computing correlation coefficients, or could it be considered more extensively as comprising PCA, cluster analysis, and other forms of analysis that relate two variables? Whether your point of view is narrowly circumscribed or broad, perhaps you would agree that the following description applies:

Correlation is an analysis that makes assumptions about the distribution of $(X,Y)$, without privileging either variable, and uses the data to draw more specific conclusions about that distribution.

For instance, you might begin by assuming $(X,Y)$ has a bivariate Normal distribution and use the Pearson correlation coefficient of the data to estimate one of the parameters of that distribution. This is one of the narrowest (and oldest) conceptions of correlation.

As another example, you might being by assuming $(X,Y)$ could have any distribution and use a cluster analysis to identify $k$ "centers." One might construe that as the beginnings of a resolution of the distribution of $(X,Y)$ into a mixture of unimodal bivariate distributions, one for each cluster.

One thing common to all these approaches is a symmetric treatment of $X$ and $Y$: neither is privileged over the other. Both play equivalent roles.

Regression

Regression enjoys a clear, universally understood definition:

Regression characterizes the conditional distribution of $Y$ (the response) given $X$ (the regressor).

Historically, regression traces its roots to Galton's discovery (c. 1885) that bivariate Normal data $(X,Y)$ enjoy a linear regression: the conditional expectation of $Y$ is a linear function of $X$. At one pole of the special-general spectrum is Ordinary Least Squares (OLS) regression where the conditional distribution of $Y$ is assumed to be Normal$(\beta_0+\beta_1 X, \sigma^2)$ for fixed parameters $\beta_0, \beta_1,$ and $\sigma$ to be estimated from the data.

At the extremely general end of this spectrum are generalized linear models, generalized additive models, and others of their ilk that relax all aspects of OLS: the expectation, variance, and even the shape of the conditional distribution of $Y$ may be allowed to vary nonlinearly with $X$. The concept that survives all this generalization is that interest remains focused on understanding how $Y$ depends on $X$. That fundamental asymmetry is still there.

Correlation and Regression

One very special situation is common to both approaches and is frequently encountered: the bivariate Normal model. In this model, a scatterplot of data will assume a classic "football," oval, or cigar shape: the data are spread elliptically around an orthogonal pair of axes.

A correlation analysis focuses on the "strength" of this relationship, in the sense that a relatively small spread around the major axis is "strong."
As remarked above, the regression of $Y$ on $X$ (and, equally, the regression of $X$ on $Y$) is linear: the conditional expectation of the response is a linear function of the regressor.

(It is worthwhile pondering the clear geometric differences between these two descriptions: they illuminate the underlying statistical differences.)

Of the five bivariate Normal parameters (two means, two spreads, and one more that measures the dependence between the two variables), one is of common interest: the fifth parameter, $\rho$. It is directly (and simply) related to

The coefficient of $X$ in the regression of $Y$ on $X$.
The coefficient of $Y$ in the regression of $X$ on $Y$.
The conditional variances in either of the regressions $(1)$ and $(2)$.
The spreads of $(X,Y)$ around the axes of an ellipse (measured as variances).

A correlation analysis focuses on $(4)$, without distinguishing the roles of $X$ and $Y$.

A regression analysis focuses on the versions of $(1)$ through $(3)$ appropriate to the choice of regressor and response variables.

In both cases, the hypothesis $H_0: \rho=0$ enjoys a special role: it indicates no correlation as well as no variation of $Y$ with respect to $X$. Because (in this simplest situation) both the probability model and the null hypothesis are common to correlation and regression, it should be no surprise that both methods share an interest in the same statistics (whether called "$r$" or "$\hat\beta$"); that the null sampling distributions of those statistics are the same; and (therefore) that hypothesis tests can produce identical p-values.

This common application, which is the first one anybody learns, can make it difficult to recognize just how different correlation and regression are in their concepts and aims. It is only when we learn about their generalizations that the underlying differences are exposed. It would be difficult to construe a GAM as giving much information about "correlation," just as it would be hard to frame a cluster analysis as a form of "regression." The two are different families of procedures with different objectives, each useful in its own right when applied appropriately.

I hope that this rather general and somewhat vague review has illuminated some of the ways in which "these issues go deeper than simply whether $r$ and $\hat\beta$ should be numerically equal." An appreciation of these differences has helped me understand what various techniques are attempting to accomplish, as well as to make better use of them in solving statistical problems.