Solved – How do we know if the correlation is significant

correlationregression

Suppose that we have continuous data $(X_1,Y_1),\dots,(X_n,Y_n)$. Suppose that $r_{x,y}$ is the Karl-Pearson correlation coefficient between $X_i$'s and $Y_i$'s. For what range of values of $r_{x,y}$, can we really decide that there may indeed be a linear relationship between $X_i$'s and $Y_i$' and proceed to predict $Y$ by using a linear regression?

I'm sure the topic concerning this question should be a well-studied one. I did a little search here; couldn't find relevant posts. Any answers to the above question or pointers to such a study is greatly appreciated.

Best Answer

For what range of values of rx,y, can we [...] proceed to predict Y by using a linear regression?

If the relationship is indeed linear, any value of correlation can work; linear regression behaves as it should across the entire range of correlations, including 0. You don't even need to examine the correlation beforehand (it seems to serve no purpose not already covered by the usual regression calculations).

However, that's a big if. You can get any correlation (except exactly 1 or -1) and not have linearity; a large (magnitude of) correlation doesn't necessarily imply the relationship is actually linear (nor does a small one imply that it isn't); correlation is not of itself a useful way to decide on the suitability of a linear regression model.

In the case of multiple regression, examining bivariate correlations is even more problematic, since the marginal bivariate correlations may be quite different from what you get in a multiple regression model. (See the Wikipedia articles on Simpson's paradox and omitted variable bias, for example.)

However, if you're interested in whether the regression is doing something useful in terms of prediction, we'd need to pin down precisely what is intended by "useful". In some cases that might be attributable to correlation values.

On the other hand, if you're instead asking "how do we perform a hypothesis test of a Pearson correlation?" you should probably edit the question to make that explicit. Under suitable assumptions you get a "standard" test readily available in packages - or fairly easily carried out by hand. [However, you're not limited to those specific assumptions, other tests of a Pearson correlation - including nonparametric tests - are possible.]

Related Solutions

Hypothesis Testing – How to Test If the Slopes in the Linear Model Are Equal to a Fixed Value

In linear regression the assumption is that $X$ and $Y$ are not random variables. Therefore, the model

$$Z = a X + b Y + \epsilon$$

is algebraically the same as

$$Z - \frac{1}{2} X - \frac{1}{2} Y = (a - \frac{1}{2})X + (b - \frac{1}{2})Y + \epsilon = \alpha X + \beta Y + \epsilon.$$

Here, $\alpha = a - \frac{1}{2}$ and $\beta =b - \frac{1}{2}$. The error term $\epsilon$ is unaffected. Fit this model, estimating the coefficients as $\hat{\alpha}$ and $\hat{\beta}$, respectively, and test the hypothesis $\alpha = \beta = 0$ in the usual way.

The statistic written at the end of the question is not a chi-squared statistic, despite its formal similarity to one. A chi-squared statistic involves counts, not data values, and must have expected values in its denominator, not covariates. It's possible for one or more of the denominators $\frac{x_i+y_i}{2}$ to be zero (or close to it), showing that something is seriously wrong with this formulation. If even that isn't convincing, consider that the units of measurement of $Z$, $X$, and $Y$ could be anything (such as drams, parsecs, and pecks), so that a linear combination like $z_i - (x_i+y_i)/2$ is (in general) meaningless. It doesn't test anything.

Solved – Issues on computing Pearson correlation coefficient for two vectors

Hi this should not be a problem since the mean is explicitly subtracted. Here's a small example (all codes in r):

require(mnormt)
#We create a multivariate Normal random variable
df<-rmnorm(n = 100, mean = rep(0, 2), matrix(c(1,0.5,0.5,1),nrow=2)) 

#We compute the correlation
cor(df)
        [,1]      [,2]
 [1,] 1.0000000 0.5605498
 [2,] 0.5605498 1.0000000

#We scale the first variable by 1000
df[,1] <- df[,1]*10000

#The correlation stays the same
cor(df)
         [,1]      [,2]
 [1,] 1.0000000 0.5605498
 [2,] 0.5605498 1.0000000

Hope this helps.

Edit Follow up to the comments (thanks to whuber): I did understand the question as being related to the magnitude of the whole vector. I understand from the discussion that some understood the question as being related to outliers. In this case my solution is, of course, not helpful.

Best Answer

Related Solutions

Hypothesis Testing – How to Test If the Slopes in the Linear Model Are Equal to a Fixed Value

Solved – Issues on computing Pearson correlation coefficient for two vectors

Related Question