I can direct you to the Fast Feature Selection based on R^2-values, based on Pearson Correlation Coefficient. This method was introduced for multidimensional MEG or EEG data, however it suits any binary machine learning problem. Put simply: it computes correlations of data and labels, then sorts obtained values, finally only the best scoring features are selected.
I implemented the approach for MATLAB and LIBSVM found at github. In my implementation you can define the amount of features to be selected, e.g. 100, 1000 or everything above the mean of scoring values.
I'm simplifying a little bit, but basically: the p-value that you're probably referring to denotes the probability that, given a particular sample size, two random sets of numbers will have a correlation greater than or equal to the one you've observed.
Example: Say we roll a pair of dice 6 times. This generates 6 unique x,y points. If the order of the rolls doesn't matter, what are the chances that you'd end up with the following points:
1,1; 2,2; 3,3; 4,4; 5,5; 6,6?
Pretty low, right? This dataset has a correlation coefficent of 1 and an n of 6. If you look up the associated p-value, you'll find it's < 0.00001. There is less than 0.00001% chance that these numbers are simply random dice rolls.
Now for some actual dice rolls - here's numbers I generated randomly in excel:
1,6; 2,3; 3,4; 3,2; 5,2; 6,4; .
The correlation coefficient of this data set is = 0.1806. Again, the sample size is 6. The associated p-value for a two-tailed test is 0.3660225.
If these were data from an experiment, we'd say that the association between x and y is so weak that it could be adequately described by random chance. In fact, we'd say that there is a 36.6% chance that there is no true treatment effect, and that the (weak) observed correlation is simply due to random chance.
A significant correlation does not necessarily imply causation. To determine whether there is a causal relationship, you must use a randomized experiment. That is one reason why tobacco companies were so difficult to prosecute in the 1960's - no scientists were "randomly assigning" humans to be smokers or non-smokers, then waiting to see who got lung cancer. The prosecutor's primary evidence came from came from observational studies, with no randomized treatments. However, after decades of strong correlation between smoking and lung cancer, along with plausible medical explanations for how tobacco smoking harms the body, courts were finally convinced that smoker's lung cancer was caused by smoking and not something else. Statisticians are the same way - in the absence of experimental data, causation is only implied by strong, repeated, and prolonged correlations, usually with a plausible mechanism for how x might be acting on y.
This page might be helpful if you're still wrapping your head around correlation coefficients.
To learn more about one- vs. two-tailed tests, check this wikipedia page.
Best Answer
In regression, the p-value of a coeficient is the result of performing a hypothesis test about correlation, with the null hypothesis being that the correlation equals zero. Having a statistically significant correlation just means that we have a small p-value; and a very small p-value means that we can be very sure that the correlation is not zero. However, please notice that being sure the correlation is different from zero doesn't tell us anything about how large the correlation is - and it can be very small.
A very small p-value with a small correlation just tells us that we can be sure that our independent variable explains a small part of the variance of our response, and therefore it has very little predictive power.
In summary, it's possible to get a correlation that is both statistically significant and very small. In addition to possible, it's quite common when we have large samples.
Edit to make an addition: This is just an occurrence of the quite common phenomenon of getting a result with large statistical significance but tiny practical significance, that often happens when sample size is large.
For example, when doing a t-test to assess if a drug reduces probability of cancer, we might get a p-value of 0.00001 for a reduction larger than zero at the same time we estimate a reduction of probability of 0.000000001%. We could be very sure that there is a reduction of probability of cancer (based on our p-value) but for any practical purpose that reduction is so tiny that we can see the drug as having no effect.
With correlation it's the same: small p-value and small correlation makes us sure that correlation exists but it's small. However, sometimes correlation is big enough to have practical meaning (independent variable explains a sizeable part of the variance of the dependent variable) but not big enough to have predictive power.