# Correlation – Is It Wrong to Include Rows with All Zero Values in Correlation Calculation?

correlationmathematical-statistics

I have two variables: age and weight. I have a collection of documents in which these two variables may or may not appear. My dataset is two lists, one for each variable with values 0 or 1 to indicate if the variable is present in the document or not. I have to calculate the correlation between them. There are some instances in which both the variables are not present in a document which makes the row values 0 for both.

My question is should I discard these rows before calculating the correlation?

A similar question is answered here. But I don't follow how it applies to my case because for me it is not the missing data.

My sample dataset:

age = [1, 0, 1, 1, 0, 0, 0, 1]
weight = [0, 0, 1, 1, 0, 0, 1, 1]


It seems eight documents may not be enough to settle whether there is association. You could display your data in a $$2\times 2$$ table with columns for A = 0 or 1 and rows for B = 0 or 1.

TBL = rbind(c(3,1), c(1,3))
TBL
[,1] [,2]
[1,]    3    1
[2,]    1    3


Trying to do Pearson's chi-squared test on TBL does not lead to a reliable P-value.

chisq.test(TBL)

Pearson's Chi-squared test
with Yates' continuity correction

data:  TBL
X-squared = 0.5, df = 1, p-value = 0.4795

Warning message:
In chisq.test(TBL) :
Chi-squared approximation may be incorrect


The warning message is shown because one or more of the expected counts (all2s) are smaller than $$5.$$

chisq.test(TBL)$exp [,1] [,2] [1,] 2 2 [2,] 2 2  R can use simulation (with parameter sim=T) to get a more accurate P-value, but the P-value remains above 5%. chisq.test(TBL, sim=T) Pearson's Chi-squared test with simulated p-value (based on 2000 replicates) data: TBL X-squared = 2, df = NA, p-value = 0.4738  Moreover, there is not enough data to get a significant answer (casting doubt on independence) using Fisher's Exact Test. fisher.test(TBL) Fisher's Exact Test for Count Data data: TBL p-value = 0.4857 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.2117329 621.9337505 sample estimates: odds ratio 6.408309  Note: With more documents (exhibiting association), you might be able to reject the null hypothesis of independence. TBL.1 [,1] [,2] [1,] 6 1 [2,] 1 6 chisq.test(TBL.1, sim=T) Pearson's Chi-squared test with simulated p-value (based on 2000 replicates) data: TBL.1 X-squared = 7.1429, df = NA, p-value = 0.02999 fisher.test(TBL.1, sim=T)$p.val
[1] 0.02913753