Correlation – Does a Statistically Significant Correlation Always Give Predictive Power?

anomaly detectioncorrelationp-value

Suppose you're trying to predict anomalies. That is, consider the case where you have a data set that has a column called result. Suppose the data set has 365 rows and result has a value of 1 in only 12 of those rows and 0 in the other rows.

Now suppose you have another column in the data set called val1. Further suppose that the p-value of the correlation between result and val1 is small (say < 0.05). Note that I'm measuring this using the R cor.test method.

Does this imply that we should be able to predict somewhat accurately the value of result given the value of val1?

I naively assumed it did and used logistic regression to predict but got very bad F1 results. (Basically the logistic regression model always predicted 0 for result and thus there were no true positives.)

Best Answer

In regression, the p-value of a coeficient is the result of performing a hypothesis test about correlation, with the null hypothesis being that the correlation equals zero. Having a statistically significant correlation just means that we have a small p-value; and a very small p-value means that we can be very sure that the correlation is not zero. However, please notice that being sure the correlation is different from zero doesn't tell us anything about how large the correlation is - and it can be very small.

A very small p-value with a small correlation just tells us that we can be sure that our independent variable explains a small part of the variance of our response, and therefore it has very little predictive power.

In summary, it's possible to get a correlation that is both statistically significant and very small. In addition to possible, it's quite common when we have large samples.


Edit to make an addition: This is just an occurrence of the quite common phenomenon of getting a result with large statistical significance but tiny practical significance, that often happens when sample size is large.

For example, when doing a t-test to assess if a drug reduces probability of cancer, we might get a p-value of 0.00001 for a reduction larger than zero at the same time we estimate a reduction of probability of 0.000000001%. We could be very sure that there is a reduction of probability of cancer (based on our p-value) but for any practical purpose that reduction is so tiny that we can see the drug as having no effect.

With correlation it's the same: small p-value and small correlation makes us sure that correlation exists but it's small. However, sometimes correlation is big enough to have practical meaning (independent variable explains a sizeable part of the variance of the dependent variable) but not big enough to have predictive power.

Related Question