Solved – How to test the linearity between two non normal distributed variables

linearpearson-rregressionspearman-rho

I have two variables $(x_i,y_i), \; i=1, \dots 300$ and I would have liked to apply a linear regression on them, but as you can see in the scatterplot below I have a very bad linear trend.

As I need to motivate it in an essay, I would like to have a measure of the amount of linearity and I used the Pearson product-moment correlation coefficient, obtaining a value of -0.1559585. But after having tested the normality of the variables with a Shapiro-Wilk test, I have obtained that that the X values are not normally distributed, therefore I cannot use the Pearson coefficient to do that. I read that I could computed Spearman's rank correlation coefficient as the X values don't follow a normal distribution, but this coefficient gives an estimation of the monotonic association between X and Y, while I would like to have a quantification of the linearity between X and Y. Do you know how I can compute a quantity that express this, please?

Thank you very much.

Edit: The qqplot of X is the following

Best Answer

Why are you even looking at the distribution of $X$? This has no effect on whether or not the relationship between $X$ and $Y$ is linear. But that aside, Pearson's correlation measures the strength of linear association, period. There are no distributional assumptions needed. Just look at a scatterplot of the points (which by the way you haven't shown, you've provided a Q-Q plot) to see if it's linear and report the correlation.

Also, goodness of fit tests will almost always result in a rejection of the null hypothesis with any reasonable sample size, so they shouldn't be relied on too much.

Matlab code

n = 36;
p = 1000;

X = randn(n,p);
C = corr(X);
offDiagElements = C(logical(triu(C,1)));

figure
step = 0.01;
x = -1:step:1;
h = histc(offDiagElements, x);
stairs(x,h/sum(h)/step)
hold on

r = -1:0.01:1;
plot(r, 1/beta(1/2,(n-2)/2)*(1-r.^2).^((n-4)/2), 'r')

sigma2 = var(offDiagElements);
plot(r, 1/sqrt(sigma2*2*pi)*exp(-r.^2/(2*sigma2)), 'k--')

Spearman's correlation coefficient

I am not aware of theoretical results about the distribution of sample Spearman's correlations. But in the simulation above it is very easy to replace the Pearson's correlations with Spearman's ones:

C = corr(X, 'type', 'Spearman');

and this does not seem to change the distribution at all.

Update: @Glen_b pointed out in chat that "the distribution can't be the same because the distribution for the Spearman is discrete while that for the Pearson is continuous". This is true and can be clearly seen with my code for smaller values of $n$. Curiously, if one uses a large enough histogram bin so that the discreteness disappears, the histogram starts overlapping perfectly with the Pearson's one. I am not sure how to formulate this relationship mathematically precisely.

Best Answer

Related Solutions

Solved – How to test for a monotonic relationship between two variables, without assuming a specific functional model

Solved – the distribution of sample correlation coefficients between two uncorrelated normal variables

Matlab code

Spearman's correlation coefficient

Related Question