[Math] Strong vs weak relationship in this correlation

correlationregressionregression analysisstatistics

enter image description here

I produced this plot and regression line in R and I thought my results were quite odd. Is the relationship of the correlation determined by how steep the regression line is? So in this case it isn't very steep, so am I fair to assume it's a weak relationship? I also wondered about my regression line, since a lot of the data is way below the line, could my regression line be incorrect?

Best Answer

Yes, something is off here. A regression line always passes through the middle of your data (the average of your x and the average of your y form a point on the line), so your line seems too high compared with the rest of your data.

The correlation is proportional to the slope of your regression line (m = r*sy/sx where sy and sx are your standard deviations for y and x, respectively), but you can't tell correlation just by looking at the line. Consider the data (1, 0), (2, 0), (3, 0), ... The best fit line will be y=0, which is perfectly horizontal (the slope is 0) yet has a perfect correlation (r=1).

I would run your regression again; make sure you are including all points. If you have a number of repeated points above the line it's possible that this is correct, but I doubt that's the case.

Related Solutions

[Math] linear regression multiple points vs average

In theory the second will be better in general, because the regression can account for the fact that you have different numbers of observations in each group.

As an extreme case, imagine you only had two observations for for one price and a hundred observations for another. The first method would treat those two prices identically and constrain the line to be as close as possible to both. This might result in it being very close to the mean of the two observation price and far from the mean of the hundred observation price. The second method would be much more willing to draw a line which was a long way from the mean of the two observation price but closer to the mean of the hundred observation price. This is a good thing because you have high uncertainty of the true mean from just two observations, so it is better to have more error there and less on the more confident estimate.

If you had exactly the same number of observations for each price then the results should be identical, and if you have close to the same number the results should be similar in which case just do whichever is easiest.

EDIT - re the last point and update 1: What I've said is provably true if you're doing least squares regression. This is because the square errors of the two fits are linearly related, and so any (even for a non-linear fit) least square error is minimised by the same fit. The crux of the proof is the fact that: $\sum (y_{i} - x)^2 = \sum (y_{i} - \mu)^2 + \sum (x - \mu)^2$, where $\mu$ is the mean of $y_{i}$

If R is giving a different answer then either there's something wrong with the R code or R isn't doing least squares linear regression. It might for instance be doing some sort of outlier removal, or a different algorithm altogether. I personally can't stand R (and this is just giving me yet another reason), so I can't motivate myself to either debug your R code nor search R's (best I can tell non-existant) documentation to find out what R is actually doing. What I can do though is counter with my own Matlab code which shows entirely consistent results between the two every time I run it:

vals = zeros(100, 4);
vals(:, 1) = normrnd(2.5, 1, 100, 1);
vals(:, 2) = normrnd(3, 0.8, 100, 1);
vals(:, 3) = normrnd(4, 1.2, 100, 1);
vals(:, 4) = normrnd(5, 1.6, 100, 1);

valmeans = mean(vals);

x_short = [2.5, 3, 4, 5];
x_long = [repmat(2.5, 1, 100), repmat(3, 1, 100), repmat(4, 1, 100), repmat(5, 1, 100)];

polyfit(x_long', vals(:), 1)
polyfit(x_short, valmeans, 1)

Pearson’s correlation and common variance

Consider two random variables $Y$ and $X$. Assume their correlation is $\rho$. Of course $\rho^2$ is always defined. However, consider a regression

$$Y=a+bX+\epsilon,$$

where $\epsilon$ is white noise. Here you can also define the regression the other way setting $X$ as the dependent variable.

Assume you estimate the parameters $(\hat{a},\hat{b})$ using OLS. Then you obtain the projection

$$\hat{Y}=\hat{a}+\hat{b}X.$$

This gives the linear relationship that best (lowest mean square distance) describes the dependency of $Y$ on $X$. You can essentially always run this regression and it is well defined assuming the relevant moments exist. Note that this does not require normality of residuals or the Gauss-Markov assumptions, which only imply that the regression has some additional "nice" properties. Of course this regression might not generally be the optimal way to model the dependency between $X$ and $Y$ nor OLS might not be the optimal way to estimate it, but in this case neither is the correlation coefficient an optimal measure of this dependency.

The coefficient of determination is

$$R^2\equiv\frac{Var(\hat{Y})}{Var(Y)}.$$

Now it turns out that $R^2=\rho^2$.

$R^2$ is defined in the context of a regression/projection. This tells you how much of the variance of $Y$ the projection explains. Applying the concept in some other context would be either misusing or extending the original meaning.

Best Answer

Related Solutions

[Math] linear regression multiple points vs average

Pearson’s correlation and common variance

Related Question