Solved – Non-parametric measure of strength of association between an ordinal and a continuous random variable

association-measurecorrelationnonparametricordinal-data

I'm throwing here the problem as I received it.

I have two random variables. One of which is continuous (Y) and the other one which is discrete and will be approached as ordinal (X). I put below the plot I received together with the query.

enter image description here

The person who send me the data wants to measure the strength of the association between X and Y. I'm looking for ideas that would not come front loaded with assumptions about what process generated the data. Note that this is not about finding a non parametric way to test the strength of the relationship (as in bootstrap) but about finding a non parametric way to measure it.

On the other hand, efficiency is not an issue since there is a lot of data points.

Best Answer

By definition, ordinal scale is the gauge wherein the true distances between notches 1 2 3 4 is unknown. It is like you are seening a ruler under drugs/alcohol. The true distances can be any. It could be 1 2 3 4 or 1 2 3 4 or whatever. We cannot compute a statistic - such as a correlation - unless decide on the distances, fix them.

One reasoning can be as follows. Since our measuring scale, the gauge, is distorted in an unknown monotonic way, we cannot believe in data values. Only the order of their magnitudes is trustworthy. Without further harness of brain, declare order to be the value. Thus, we replace the observed distribution by uniform distribution, the ranks. After that, may compute association coefficient, say, Pearson $r$. That will be Spearman $rho$, as we know. Pearson $r$ measures the strength of linear association. Ranking the variables was a trick to linearize that portion of monotonic relationship that is attributed to the distributions not having been uniform initially. Thus, Spearman $rho$ is the measure of such monotonicity in the relationship which can be converted to linearity under the action of uniforming the marginal distributions. In the OP question, only one of the two variables is ordinal (and the second is continuous). So, there is no need, generally, to rank both variables. May just rank the ordinal one and then compute $r$.

Another approach, alternative to ranking (uniforming), may be optimal scaling of the ordinal variable. Optimal scaling is an iterative procedure with the goal to find such distances on the ordinal scale - i.e. find such monotonic transformation of it - so that linear $r$ between the variables is maximized as possible. While ranking approach is based on premise "true scale corresponds to data having uniform distribution", optimal scaling approach is based on premise "true scale corresponds to data having maximal linear $r$". Optimal scaling can be done in categorical regression (CATREG). However, categorical regression requires that the other input variable be discrete (not necessarily ordinal) and so if it is continuous having many unique values it will have to be arbitrarily binned by you.

There are other approaches as well. But in any way, we transform the ordinal scale monotonically "so as to..." (some assumption or some goal), because ordinal scale is distorted to us in an unknown way. Radically another decision would be to "sober up" first and decide that it is either not distorted (i.e. it is interval), or distorted in a known way (is nonequiinterval), or is nominal.

Some asymmetric approaches may include ordinal regression of the ordinal variable by the other (interval/continuous) one. Or linear regression of that latter by the ordinal one, with the model where the predictor is taken as polynomial contrast (that is, entered as b1X + b2X^2 + b3X^3,...). The weakness of these approaches is that they are asymmetric: one variable is dependent, the other is independent.

Related Question