I have two variables: age and weight. I have a collection of documents in which these two variables may or may not appear. My dataset is two lists, one for each variable with values 0 or 1 to indicate if the variable is present in the document or not. I have to calculate the correlation between them. There are some instances in which both the variables are not present in a document which makes the row values 0 for both.
My question is should I discard these rows before calculating the correlation?
A similar question is answered here. But I don't follow how it applies to my case because for me it is not the missing data.
My sample dataset:
age = [1, 0, 1, 1, 0, 0, 0, 1]
weight = [0, 0, 1, 1, 0, 0, 1, 1]
Best Answer
It seems eight documents may not be enough to settle whether there is association. You could display your data in a $2\times 2$ table with columns for A = 0 or 1 and rows for B = 0 or 1.
Trying to do Pearson's chi-squared test on
TBL
does not lead to a reliable P-value.The warning message is shown because one or more of the expected counts (all
2
s) are smaller than $5.$R can use simulation (with parameter
sim=T
) to get a more accurate P-value, but the P-value remains above 5%.Moreover, there is not enough data to get a significant answer (casting doubt on independence) using Fisher's Exact Test.
Note: With more documents (exhibiting association), you might be able to reject the null hypothesis of independence.