I'm trying to find a test to establish the correlation between a certain value of a substance measured inside the skin (continuous: 0,65 or 1,15 etc.) and the bacterial load found on the skin (ordinal: Negative, load 1, load 2, load 3, load 4)
Which test is most suited to do this?
Solved – Correlation between ordinal and continuous data
continuous datacorrelationordinal-data
Related Solutions
I am a little confused; your title says "correlation" but your post refers to t-tests. A t-test is a test of central location - more specifically, is the mean of one set of data different from the mean of another set? Correlation, on the other hand, shows the relationship between two variables. There are a variety of correlation measures, it seems that point-biserial correlation is appropriate in your case.
You are correct that a t-test assumes normality; however, the tests of normality are likely to give significant results even for trivial non-normalities with an N of 4000. T-tests are fairly robust to modest deviations from normality if the variances of the two sets of data are roughly equal and the sample sizes roughly equal. But a nonparametric test is more robust to outliers and most of them have power almost as high as the t-test, even if the distributions are normal.
However, in your example, you use "cholesterol" as being risky or not-risky. This is almost certainly a bad idea. Dichotomizing a continuous variable invokes magical thinking. It says that, at some point, cholesterol goes from "not risky" to "risky". Suppose you used 200 as your cutoff - then you are saying that someone with cholesterol of 201 is just like someone with 400, and someone with 199 is just like someone with 100. This does not make sense.
Nominal vs Interval
The most classic "correlation" measure between a nominal and an interval ("numeric") variable is Eta, also called correlation ratio, and equal to the root R-square of the one-way ANOVA (with p-value = that of the ANOVA). Eta can be seen as a symmetric association measure, like correlation, because Eta of ANOVA (with the nominal as independent, numeric as dependent) is equal to Pillai's trace of multivariate regression (with the numeric as independent, set of dummy variables corresponding to the nominal as dependent).
A more subtle measure is intraclass correlation coefficient (ICC). Whereas Eta grasps only the difference between groups (defined by the nominal variable) in respect to the numeric variable, ICC simultaneously also measures the coordination or agreemant between numeric values inside groups; in other words, ICC (particularly the original unbiased "pairing" ICC version) stays on the level of values while Eta operates on the level of statistics (group means vs group variances).
Nominal vs Ordinal
The question about "correlation" measure between a nominal and an ordinal variable is less apparent. The reason of the difficulty is that ordinal scale is, by its nature, more "mystic" or "twisted" than interval or nominal scales. No wonder that statistical analyses specially for ordinal data are relatively poorly formulated so far.
One way might be to convert your ordinal data into ranks and then compute Eta as if the ranks were interval data. The p-value of such Eta = that of Kruskal-Wallis analysis. This approach seems warranted due to the same reasoning as why Spearman rho is used to correlate two ordinal variables. That logic is "when you don't know the interval widths on the scale, cut the Gordian knot by linearizing any possible monotonicity: go rank the data".
Another approach (possibly more rigorous and flexible) would be to use ordinal logistic regression with the ordinal variable as the DV and the nominal one as the IV. The square root of Nagelkerke’s pseudo R-square (with the regression's p-value) is another correlation measure for you. Note that you can experiment with various link functions in ordinal regression. This association is, however, not symmetric: the nominal is assumed independent.
Yet another approach might be to find such a monotonic transformation of ordinal data into interval - instead of ranking of the penultimate paragraph - that would maximize R (i.e. Eta) for you. This is categorical regression (= linear regression with optimal scaling).
Still another approach is to perform classification tree, such as CHAID, with the ordinal variable as predictor. This procedure will bin together (hence it is the approach opposite to the previous one) adjacent ordered categories which do not distinguish among categories of the nominal predictand. Then you could rely on Chi-square-based association measures (such as Cramer's V) as if you correlate nominal vs nominal variables.
And @Michael in his comment suggests yet one more way - a special coefficient called Freeman's Theta.
So, we have arrived so far at these opportunities: (1) Rank, then compute Eta; (2) Use ordinal regression; (3) Use categorical regression ("optimally" transforming ordinal variable into interval); (4) Use classification tree ("optimally" reducing the number of ordered categories); (5) Use Freeman's Theta.
Best Answer
My suggestion is to use a Spearman’s rank-order correlation (for example see here ), so that the continuous variable will be re-expressed as a ranked variable (so for each observation you will take its ordinal rank compared to the rest of the observations in the sample) and its rank will be comparable to the rank of the ordinal variable. However make sure to express the ordinal variable correctly in numerical terms. For example use 0,1,2,3 etc. Because all the variables used must be numerical.