First, he said he would run a regression analysis, then he showed us
the analysis of variance. Why?
Analysis of variance (ANOVA) is just a technique comparing the variance explained by the model versus the variance not explained by the model. Since regression models have both the explained and unexplained component, it's natural that ANOVA can be applied to them. In many software packages, ANOVA results are routinely reported with linear regression. Regression is also a very versatile technique. In fact, both t-test and ANOVA can be expressed in regression form; they are just a special case of regression.
For example, here is a sample regression output. The outcome is miles per gallon of some cars and the independent variable is whether the car was domestic or foreign:
Source | SS df MS Number of obs = 74
-------------+------------------------------ F( 1, 72) = 13.18
Model | 378.153515 1 378.153515 Prob > F = 0.0005
Residual | 2065.30594 72 28.6848048 R-squared = 0.1548
-------------+------------------------------ Adj R-squared = 0.1430
Total | 2443.45946 73 33.4720474 Root MSE = 5.3558
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.foreign | 4.945804 1.362162 3.63 0.001 2.230384 7.661225
_cons | 19.82692 .7427186 26.70 0.000 18.34634 21.30751
------------------------------------------------------------------------------
You can see the ANOVA reported at top left. The overall F-statistics is 13.18, with a p-value of 0.0005, indicating the model being predictive. And here is the ANOVA output:
Number of obs = 74 R-squared = 0.1548
Root MSE = 5.35582 Adj R-squared = 0.1430
Source | Partial SS df MS F Prob > F
-----------+----------------------------------------------------
Model | 378.153515 1 378.153515 13.18 0.0005
|
foreign | 378.153515 1 378.153515 13.18 0.0005
|
Residual | 2065.30594 72 28.6848048
-----------+----------------------------------------------------
Total | 2443.45946 73 33.4720474
Notice that you can recover the same F-statistics and p-value there.
And then he wrote about the correlation coefficient, is that not from
correlation analysis? Or this word could also be used to describe
regression slope?
Assuming the analysis involved using only B and Y, technically I would not agree with the word choice. In most of the cases, slope and correlation coefficient cannot be used interchangeably. In one special case, these two are the same, that is when both the independent and dependent variables are standardized (aka in the unit of z-score.)
For example, let's correlate miles per gallon and the price of the car:
| price mpg
-------------+------------------
price | 1.0000
mpg | -0.4686 1.0000
And here is the same test, using the standardized variables, you can see the correlation coefficient remains unchanged:
| sdprice sdmpg
-------------+------------------
sdprice | 1.0000
sdmpg | -0.4686 1.0000
Now, here are the two regression models using the original variables:
. reg mpg price
Source | SS df MS Number of obs = 74
-------------+------------------------------ F( 1, 72) = 20.26
Model | 536.541807 1 536.541807 Prob > F = 0.0000
Residual | 1906.91765 72 26.4849674 R-squared = 0.2196
-------------+------------------------------ Adj R-squared = 0.2087
Total | 2443.45946 73 33.4720474 Root MSE = 5.1464
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
price | -.0009192 .0002042 -4.50 0.000 -.0013263 -.0005121
_cons | 26.96417 1.393952 19.34 0.000 24.18538 29.74297
------------------------------------------------------------------------------
... and here is the one with standardized variables:
. reg sdmpg sdprice
Source | SS df MS Number of obs = 74
-------------+------------------------------ F( 1, 72) = 20.26
Model | 16.0295482 1 16.0295482 Prob > F = 0.0000
Residual | 56.9704514 72 .791256269 R-squared = 0.2196
-------------+------------------------------ Adj R-squared = 0.2087
Total | 72.9999996 73 .999999994 Root MSE = .88953
------------------------------------------------------------------------------
sdmpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
sdprice | -.4685967 .1041111 -4.50 0.000 -.6761384 -.2610549
_cons | -7.22e-09 .1034053 -0.00 1.000 -.2061347 .2061347
------------------------------------------------------------------------------
As you can see, the slope of the original variables is -0.0009192, and the one with standardized variables is -0.4686, which is also the correlation coefficient.
So, unless the A, B, C, and Y are standardized, I would not agree with the article's "correlating." Instead, I'd just opt of a one unit increase in B is associated with the average of Y being 0.27 higher.
In more complicated situation, where more than one independent variable is involved, the phenomenon described above will no longer be true.
Introduction
This reply addresses the underlying motivation for this set of questions:
What are the assumptions underlying a correlation test and a regression slope test?
In light of the background provided in the question, though, I would like to suggest expanding this question a little: let us explore the different purposes and conceptions of correlation and regression.
Correlation typically is invoked in situations where
Data are bivariate: exactly two distinct values of interest are associated with each "subject" or "observation".
The data are observational: neither of the values was set by the experimenter. Both were observed or measured.
Interest lies in identifying, quantifying, and testing some kind of relationship between the variables.
Regression is used where
Data are bivariate or multivariate: there may be more than two distinct values of interest.
Interest focuses on understanding what can be said about a subset of the variables--the "dependent" variables or "responses"--based on what might be known about the other subset--the "independent" variables or "regressors."
Specific values of the regressors may have been set by the experimenter.
These differing aims and situations lead to distinct approaches. Because this thread is concerned about their similarities, let's focus on the case where they are most similar: bivariate data. In either case those data will typically be modeled as realizations of a random variable $(X,Y)$. Very generally, both forms of analysis seek relatively simple characterizations of this variable.
Correlation
I believe "correlation analysis" has never been generally defined. Should it be limited to computing correlation coefficients, or could it be considered more extensively as comprising PCA, cluster analysis, and other forms of analysis that relate two variables? Whether your point of view is narrowly circumscribed or broad, perhaps you would agree that the following description applies:
Correlation is an analysis that makes assumptions about the distribution of $(X,Y)$, without privileging either variable, and uses the data to draw more specific conclusions about that distribution.
For instance, you might begin by assuming $(X,Y)$ has a bivariate Normal distribution and use the Pearson correlation coefficient of the data to estimate one of the parameters of that distribution. This is one of the narrowest (and oldest) conceptions of correlation.
As another example, you might being by assuming $(X,Y)$ could have any distribution and use a cluster analysis to identify $k$ "centers." One might construe that as the beginnings of a resolution of the distribution of $(X,Y)$ into a mixture of unimodal bivariate distributions, one for each cluster.
One thing common to all these approaches is a symmetric treatment of $X$ and $Y$: neither is privileged over the other. Both play equivalent roles.
Regression
Regression enjoys a clear, universally understood definition:
Regression characterizes the conditional distribution of $Y$ (the response) given $X$ (the regressor).
Historically, regression traces its roots to Galton's discovery (c. 1885) that bivariate Normal data $(X,Y)$ enjoy a linear regression: the conditional expectation of $Y$ is a linear function of $X$. At one pole of the special-general spectrum is Ordinary Least Squares (OLS) regression where the conditional distribution of $Y$ is assumed to be Normal$(\beta_0+\beta_1 X, \sigma^2)$ for fixed parameters $\beta_0, \beta_1,$ and $\sigma$ to be estimated from the data.
At the extremely general end of this spectrum are generalized linear models, generalized additive models, and others of their ilk that relax all aspects of OLS: the expectation, variance, and even the shape of the conditional distribution of $Y$ may be allowed to vary nonlinearly with $X$. The concept that survives all this generalization is that interest remains focused on understanding how $Y$ depends on $X$. That fundamental asymmetry is still there.
Correlation and Regression
One very special situation is common to both approaches and is frequently encountered: the bivariate Normal model. In this model, a scatterplot of data will assume a classic "football," oval, or cigar shape: the data are spread elliptically around an orthogonal pair of axes.
A correlation analysis focuses on the "strength" of this relationship, in the sense that a relatively small spread around the major axis is "strong."
As remarked above, the regression of $Y$ on $X$ (and, equally, the regression of $X$ on $Y$) is linear: the conditional expectation of the response is a linear function of the regressor.
(It is worthwhile pondering the clear geometric differences between these two descriptions: they illuminate the underlying statistical differences.)
Of the five bivariate Normal parameters (two means, two spreads, and one more that measures the dependence between the two variables), one is of common interest: the fifth parameter, $\rho$. It is directly (and simply) related to
The coefficient of $X$ in the regression of $Y$ on $X$.
The coefficient of $Y$ in the regression of $X$ on $Y$.
The conditional variances in either of the regressions $(1)$ and $(2)$.
The spreads of $(X,Y)$ around the axes of an ellipse (measured as variances).
A correlation analysis focuses on $(4)$, without distinguishing the roles of $X$ and $Y$.
A regression analysis focuses on the versions of $(1)$ through $(3)$ appropriate to the choice of regressor and response variables.
In both cases, the hypothesis $H_0: \rho=0$ enjoys a special role: it indicates no correlation as well as no variation of $Y$ with respect to $X$. Because (in this simplest situation) both the probability model and the null hypothesis are common to correlation and regression, it should be no surprise that both methods share an interest in the same statistics (whether called "$r$" or "$\hat\beta$"); that the null sampling distributions of those statistics are the same; and (therefore) that hypothesis tests can produce identical p-values.
This common application, which is the first one anybody learns, can make it difficult to recognize just how different correlation and regression are in their concepts and aims. It is only when we learn about their generalizations that the underlying differences are exposed. It would be difficult to construe a GAM as giving much information about "correlation," just as it would be hard to frame a cluster analysis as a form of "regression." The two are different families of procedures with different objectives, each useful in its own right when applied appropriately.
I hope that this rather general and somewhat vague review has illuminated some of the ways in which "these issues go deeper than simply whether $r$ and $\hat\beta$ should be numerically equal." An appreciation of these differences has helped me understand what various techniques are attempting to accomplish, as well as to make better use of them in solving statistical problems.
Best Answer
There are different questions in this question. Neither correlation nor linear regression can prove causal relationship. But in your mind and in the model, the correlation is not directed but regression is. There is no difference in correlation, whether you think one value is the reason for the other whereas the formulation of a linear regression modell usually implies a direction. At least with ordinary least squares, it is not the same, whether you write $Y = aX+b$ or $X = cY+d$. However $cor(X,Y) = cor(Y,X)$.
Correlation and linear regression are familar, but the link is the $R^2$ value which results from linear regression and is indeed the square of the correlation coefficient $r$. You have not mentioned $R^2$ in your post so maybe this will help to get a better understanding.
The p-value mainly tells you, whether you sampled a large enough sample to conclude, which sign the correlation coefficient a and the regression coefficient r have.