Correlation – Best Programmatic Methods for Determining Relationships Between Two Variables

correlationdata visualizationlinearnonlinearpython

What is the best programmatic way for determining whether two predictor variables are linearly or non-linearly or not even related, maybe using any of the packages scipy/statsmodels or anything else in python.

I know about the ways like plotting and manually checking. But I am looking for some other programmatic technique that is almost certain to differentiate whether a bivariate plot would be linear or non-linear or no relationship between them in nature.

I hear about the concept of KL divergence somewhere. Not really sure of the concept and in-depth, and whether can it really be applied for this sort of problem.

Best Answer

It is very difficult to achieve what you want programmatically because there are so many different forms of nonlinear associations. Even looking at correlation or regression coefficients will not really help. It is always good to refer back to Anscombe's quartet when thinking about problems like this:

Obviously the association between the two variables is completely different in each plot, but each has exactly the same correlation coefficient.

If you know a priori what the possible non-linear relations could be, then you could fit a series of nonlinear models and compare the goodness of fit. But if you don't know what the possible non-linear relations could be, then I can't see how it can be done robustly without visually inspecting the data. Cubic splines could be one possibility but then it may not cope well with logarithmic, exponential and sinusoidal associations, and could be prone to overfitting. EDIT: After some further thought, another approach would be to fit a generalised additive model (GAM) which would provide good insight for many nonlinear associations, but probably not sinusoidal ones.

Truly, the best way to do what you want is visually. We can see instantly what the relations are in the plots above, but any programmatic approach such as regression is bound to have situations where it fails miserably.

So, my suggestion, if you really need to do this is to use a classifier based on the image of the bivariate plot.

create a dataset using randomly generated data for one variable, from a randomly chosen distribution.
Generate the other variable with a linear association (with random slope) and add some random noise. Then choose at random a nonlinear association and create a new set of values for the other variable. You may want to include purely random associations in this group.
Create two bivariate plots, one linear the other nonlinear from the data simulated in 1) and 2). Normalise the data first.
Repeat the above steps millions of times, or as many times as your time scale will allow
Create a classifier, train, test and validate it, to classify linear vs nonlinear images.
For your actual use case, if you have a different sample size to your simulated data then sample or re-sample to get obtain the same size. Normalise the data, create the image and apply the classifier to it.

I realise that this is probably not the kind of answer you want, but I cannot think of a robust way to do this with regression or other model-based approach.

EDIT: I hope no one is taking this too seriously. My point here is that, in a situation with bivariate data, we should always plot the data. Trying to do anything programatically, whether it is a GAM, cubic splines or a vast machine learning approach is basically allowing the analyst to not think, which is a very dangerous thing.

Please always plot your data.

Related Solutions

Multicollinearity Testing – How to Test for Multicollinearity Among Non-linearly Related Independent Variables

Multicolinearity is all about the linear relationship among you independent/explanatory/right-hand-side/x-variables. That you want to use those variables in a non-linear model does not matter. The logic behind that is that if you want to add both variables to your model then you have te be able to distinguish between a unit change in one variable and a unit change in the other. If the variables are linearly related then a unit change in one coincides with $k$ units increase in the other variables, where $k$ is some constant, so we cannot determine the separate effects of both variables. If the relationship is non-linear a unit change in one variable coincides with a variable number of units change in the other, so we are able to distinguish between the variables. So if you graphically determined that there is a relationship but that relationship is non-linear then that fact alone has already solved most of your problems.

Consider the following example: if we add a quadratic curve, that is, we add a variable $x$ and a variable $x^2$ to our model, then the relationship between the variables $x$ and $x^2$ is extremely strong. Still we can estimate that model. The reason is that that relationship is non-linear.

I find it informative to see a situation where this can break. Consider we have a study where we want to consider year of birth, which ranges between 1950 and 1990. If we just add that and its square then you might get into trouble as the relationship between birthyear and birthyear$^2$ is almost linear, as you can see below. You can solve this by centering at a meaningful variable within the range of your data, e.g. 1960. As you can see in the second graph the relationship is now non-linear and that is usually enough to solve the problem.

enter image description here

I created that graph with Stata using the following code:

twoway function xsquare = x^2, range(1950 1990) ///
    name(a,replace) title(uncentered) ytitle("x{sup:2}")
twoway function xsquare = (x-1960)^2, range(1950 1990) ///
    name(b, replace) title(centered) ytitle("(x-1960){sup:2}")
graph combine a b, ysize(3)

Solved – Does zero correlation always imply that the two variables are not related, even in a smaller sample

You're talking about the possibility here of a missing explanatory variable (in your scenario, one that interacts with porn usage in its effect on erectile dysfunction).

It's not necessary to have an interaction for a problem like this to arise; you can get it as long as the ignored variable is not evenly distributed across the variables you do have (see Simpson's paradox, especially the diagram which shows that correlation can even have entirely the wrong sign if you ignore such a variable, even if the direction of the relationship is the same within both groups!).

Clearly if you ignore an important variable, then that can lead to an important relationship looking like no relationship, or vice-versa (or worse make the correlation look like the opposite sign to what the actual conditional relationship is). Or there may be another variable that's causing both the ones you observe to change together, even though they're not connected.

There's also the issue that it's possible to have dependence that is not linear* -- two things may be strongly related but uncorrelated (e.g. what if erectile dysfunction is highest if you watch no porn and if you watch a lot of porn, and lowest in between? That could show up as zero correlation, but would not imply that you are safe with all levels of porn activity).

*(see bottom row of first diagram there Edit: the image also appears in pzsolt's answer)

There are a variety of other ways in which you might have what looks like zero correlation not implying "no relationship" --- or conversely, there are a number of situations where there appears to be a relationship but there's really no causal connection at all.

There's much more that could be said -- for example, I haven't even touched on spurious regression yet (though it's related to some of the things I have mentioned).

Best Answer

Related Solutions

Multicollinearity Testing – How to Test for Multicollinearity Among Non-linearly Related Independent Variables

Solved – Does zero correlation always imply that the two variables are not related, even in a smaller sample

Related Question