Solved – Correlation between dichotomous and continuous variable

association-measurecategorical datacontinuous datakolmogorov-smirnov testnormal distribution

I am trying to find the correlation between a dichotomous and a continuous variable.

From my ground work on this I found that I have to use independent t-test and the precondition for it is that the distribution of the variable has to be normal.

I performed Kolmogorov-Smirnov test for testing the normality and found that the continuous variable is non-normal and is skewed (for about 4,000 data points).

I did the Kolmogorov-Smirnov test for the entire range of variables. Should I split them into groups and do the test? I.e., say if I have risk level (0 = Not risky, 1 = Risky) and cholesterol levels, then should I:

  • Divide them into two groups, like

    Risk level =0 (Cholestrol level) -> Apply KS
    Risk level =1 (Cholestrol level) -> Apply KS
    
  • Take them together and apply the test? (I performed it on the whole dataset only.)

After that, what test should I do if it is still non-normal?

EDIT:
The above scenario was just a description I tried to provide for my problem. I have a dataset which contains more than 1000 variables and about 4000 samples. They are either continuous or categorical in nature. My task is to predict a dichotomous variable based on these variables (maybe come up with a logistic regression model). So I thought the initial investigation would involve finding the correlation between dichotomous and a continuous variable.

I was trying to see how the distribution of the variables are and hence tried to go to t-test. Here I found the normality as an issue. The Kolmogorov-Smirnov test gave a significance value of 0.00 in most of these variables.

Should I assume normality here? The skewness and kurtosis of these variables also show that the data is skewed (>0) in almost all cases.

As per the note given below I will investigate the point-biserial correlation further. But about the distribution of variables I am still unsure.

Best Answer

I am a little confused; your title says "correlation" but your post refers to t-tests. A t-test is a test of central location - more specifically, is the mean of one set of data different from the mean of another set? Correlation, on the other hand, shows the relationship between two variables. There are a variety of correlation measures, it seems that point-biserial correlation is appropriate in your case.

You are correct that a t-test assumes normality; however, the tests of normality are likely to give significant results even for trivial non-normalities with an N of 4000. T-tests are fairly robust to modest deviations from normality if the variances of the two sets of data are roughly equal and the sample sizes roughly equal. But a nonparametric test is more robust to outliers and most of them have power almost as high as the t-test, even if the distributions are normal.

However, in your example, you use "cholesterol" as being risky or not-risky. This is almost certainly a bad idea. Dichotomizing a continuous variable invokes magical thinking. It says that, at some point, cholesterol goes from "not risky" to "risky". Suppose you used 200 as your cutoff - then you are saying that someone with cholesterol of 201 is just like someone with 400, and someone with 199 is just like someone with 100. This does not make sense.