Correlation – Is It Wrong to Include Rows with All Zero Values in Correlation Calculation?

correlationmathematical-statistics

I have two variables: age and weight. I have a collection of documents in which these two variables may or may not appear. My dataset is two lists, one for each variable with values 0 or 1 to indicate if the variable is present in the document or not. I have to calculate the correlation between them. There are some instances in which both the variables are not present in a document which makes the row values 0 for both.

My question is should I discard these rows before calculating the correlation?

A similar question is answered here. But I don't follow how it applies to my case because for me it is not the missing data.

My sample dataset:

age = [1, 0, 1, 1, 0, 0, 0, 1]
weight = [0, 0, 1, 1, 0, 0, 1, 1]

Best Answer

It seems eight documents may not be enough to settle whether there is association. You could display your data in a $2\times 2$ table with columns for A = 0 or 1 and rows for B = 0 or 1.

TBL = rbind(c(3,1), c(1,3))
TBL
     [,1] [,2]
[1,]    3    1
[2,]    1    3

Trying to do Pearson's chi-squared test on TBL does not lead to a reliable P-value.

chisq.test(TBL)

        Pearson's Chi-squared test 
        with Yates' continuity correction

data:  TBL
X-squared = 0.5, df = 1, p-value = 0.4795

Warning message:
In chisq.test(TBL) : 
 Chi-squared approximation may be incorrect

The warning message is shown because one or more of the expected counts (all2s) are smaller than $5.$

chisq.test(TBL)$exp
     [,1] [,2]
[1,]    2    2
[2,]    2    2

R can use simulation (with parameter sim=T) to get a more accurate P-value, but the P-value remains above 5%.

chisq.test(TBL, sim=T)

        Pearson's Chi-squared test 
        with simulated p-value 
        (based on 2000 replicates)

data:  TBL
X-squared = 2, df = NA, p-value = 0.4738

Moreover, there is not enough data to get a significant answer (casting doubt on independence) using Fisher's Exact Test.

fisher.test(TBL)

        Fisher's Exact Test for Count Data

data:  TBL
p-value = 0.4857
alternative hypothesis: 
 true odds ratio is not equal to 1
95 percent confidence interval:
    0.2117329 621.9337505
sample estimates:
 odds ratio 
   6.408309 

Note: With more documents (exhibiting association), you might be able to reject the null hypothesis of independence.

TBL.1
     [,1] [,2]
[1,]    6    1
[2,]    1    6

chisq.test(TBL.1, sim=T)

        Pearson's Chi-squared test 
        with simulated p-value 
        (based on 2000 replicates)

data:  TBL.1
X-squared = 7.1429, df = NA, p-value = 0.02999

fisher.test(TBL.1, sim=T)$p.val
[1] 0.02913753
Related Question