Solved – Does “correlation” also mean the slope in regression analysis

correlationregressionterminology

I'm reading a paper and the author wrote:

The effect of A,B, C on Y was studied through the use of multiple regression analysis. A,B,C were entered into the regression equation with Y as the dependent variable. The analysis of variance is presented in Table 3.
The effect of B on Y was significant, with B correlating .27 with Y.

English is not my mother tongue and I got really confused here.

First, he said he would run a regression analysis, then he showed us the analysis of variance. Why?

And then he wrote about the correlation coefficient, is that not from correlation analysis? Or this word could also be used to describe regression slope?

Best Answer

First, he said he would run a regression analysis, then he showed us the analysis of variance. Why?

Analysis of variance (ANOVA) is just a technique comparing the variance explained by the model versus the variance not explained by the model. Since regression models have both the explained and unexplained component, it's natural that ANOVA can be applied to them. In many software packages, ANOVA results are routinely reported with linear regression. Regression is also a very versatile technique. In fact, both t-test and ANOVA can be expressed in regression form; they are just a special case of regression.

For example, here is a sample regression output. The outcome is miles per gallon of some cars and the independent variable is whether the car was domestic or foreign:

      Source |       SS       df       MS              Number of obs =      74
-------------+------------------------------           F(  1,    72) =   13.18
       Model |  378.153515     1  378.153515           Prob > F      =  0.0005
    Residual |  2065.30594    72  28.6848048           R-squared     =  0.1548
-------------+------------------------------           Adj R-squared =  0.1430
       Total |  2443.45946    73  33.4720474           Root MSE      =  5.3558

------------------------------------------------------------------------------
         mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   1.foreign |   4.945804   1.362162     3.63   0.001     2.230384    7.661225
       _cons |   19.82692   .7427186    26.70   0.000     18.34634    21.30751
------------------------------------------------------------------------------

You can see the ANOVA reported at top left. The overall F-statistics is 13.18, with a p-value of 0.0005, indicating the model being predictive. And here is the ANOVA output:

                       Number of obs =      74     R-squared     =  0.1548
                       Root MSE      = 5.35582     Adj R-squared =  0.1430

              Source |  Partial SS    df       MS           F     Prob > F
          -----------+----------------------------------------------------
               Model |  378.153515     1  378.153515      13.18     0.0005
                     |
             foreign |  378.153515     1  378.153515      13.18     0.0005
                     |
            Residual |  2065.30594    72  28.6848048   
          -----------+----------------------------------------------------
               Total |  2443.45946    73  33.4720474   

Notice that you can recover the same F-statistics and p-value there.


And then he wrote about the correlation coefficient, is that not from correlation analysis? Or this word could also be used to describe regression slope?

Assuming the analysis involved using only B and Y, technically I would not agree with the word choice. In most of the cases, slope and correlation coefficient cannot be used interchangeably. In one special case, these two are the same, that is when both the independent and dependent variables are standardized (aka in the unit of z-score.)

For example, let's correlate miles per gallon and the price of the car:

             |    price      mpg
-------------+------------------
       price |   1.0000
         mpg |  -0.4686   1.0000

And here is the same test, using the standardized variables, you can see the correlation coefficient remains unchanged:

             |  sdprice    sdmpg
-------------+------------------
     sdprice |   1.0000
       sdmpg |  -0.4686   1.0000

Now, here are the two regression models using the original variables:

. reg mpg price

      Source |       SS       df       MS              Number of obs =      74
-------------+------------------------------           F(  1,    72) =   20.26
       Model |  536.541807     1  536.541807           Prob > F      =  0.0000
    Residual |  1906.91765    72  26.4849674           R-squared     =  0.2196
-------------+------------------------------           Adj R-squared =  0.2087
       Total |  2443.45946    73  33.4720474           Root MSE      =  5.1464

------------------------------------------------------------------------------
         mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       price |  -.0009192   .0002042    -4.50   0.000    -.0013263   -.0005121
       _cons |   26.96417   1.393952    19.34   0.000     24.18538    29.74297
------------------------------------------------------------------------------

... and here is the one with standardized variables:

. reg sdmpg sdprice

      Source |       SS       df       MS              Number of obs =      74
-------------+------------------------------           F(  1,    72) =   20.26
       Model |  16.0295482     1  16.0295482           Prob > F      =  0.0000
    Residual |  56.9704514    72  .791256269           R-squared     =  0.2196
-------------+------------------------------           Adj R-squared =  0.2087
       Total |  72.9999996    73  .999999994           Root MSE      =  .88953

------------------------------------------------------------------------------
       sdmpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     sdprice |  -.4685967   .1041111    -4.50   0.000    -.6761384   -.2610549
       _cons |  -7.22e-09   .1034053    -0.00   1.000    -.2061347    .2061347
------------------------------------------------------------------------------

As you can see, the slope of the original variables is -0.0009192, and the one with standardized variables is -0.4686, which is also the correlation coefficient.

So, unless the A, B, C, and Y are standardized, I would not agree with the article's "correlating." Instead, I'd just opt of a one unit increase in B is associated with the average of Y being 0.27 higher.

In more complicated situation, where more than one independent variable is involved, the phenomenon described above will no longer be true.

Related Question