I have two variables: age and weight. I have a collection of documents in which these two variables may or may not appear. My dataset is two lists, one for each variable with values 0 or 1 to indicate if the variable is present in the document or not. I have to calculate the correlation between them. There are some instances in which both the variables are not present in a document which makes the row values 0 for both.

My question is should I discard these rows before calculating the correlation?

A similar question is answered here. But I don't follow how it applies to my case because for me it is not the missing data.

My sample dataset:

```
age = [1, 0, 1, 1, 0, 0, 0, 1]
weight = [0, 0, 1, 1, 0, 0, 1, 1]
```

## Best Answer

It seems eight documents may not be enough to settle whether there is association. You could display your data in a $2\times 2$ table with columns for A = 0 or 1 and rows for B = 0 or 1.

Trying to do Pearson's chi-squared test on

`TBL`

does not lead to a reliable P-value.The warning message is shown because one or more of the expected counts (all

`2`

s) are smaller than $5.$R can use simulation (with parameter

`sim=T`

) to get a more accurate P-value, but the P-value remains above 5%.Moreover, there is not enough data to get a significant answer (casting doubt on independence) using Fisher's Exact Test.

Note:With more documents (exhibiting association), you might be able to reject the null hypothesis of independence.