Solved – Correlate two variables, with many (0,0) values

correlation

Suppose you have a large but finite collection of tweets. You want to know whether talking about football tends to correlate with talking about basketball. You can generate a table for a few hundred users with x's of "NFL" mentions, and y's of "NBA" mentions for each user. Now consider the case where over half of them are (0,0). I actually have such tables for many word pairs: some graphs look like a messy y=mx, some look as if bounded by y=1/mx, some are one quadrant of a shotgun blast.

Q: is there any mathematically sound way of describing the statistics, the correlations, when so many values are (0,0)?

Intuitively speaking, I've run into two problems:

1) Using a simple linear correlation function in a spreadsheet, I seem to get similar correlation (r^2) values whether "I can tell" it's a shotgun or it's a y=1/x bounded system (i.e., exclusivity). I.d like a measure that distinguishes between exclusivity and no relation at all.

2) Sometimes I've generated graphs which look like y=1/x, and proves a case of exclusivity (such as sheep vs. goats) which I already believe to be true. Other times for very similar concepts, however, I see the same graph shape which implies exclusivity, a discrepancy that seems illogical (such as "football" vs. "NFL"), unless I've somehow discovered distinct populations that use different words to describe a similar interest. I'm wondering if what my intuitive response to these exclusivity graphs is ignoring hundreds of points squished at the origin : (1,1)'s.

I hope for a statistical operation that would take my gut feel out of this analysis. Thanks

Best Answer

Since there are so many zeros, have you considered ignoring the counts and just looking at conditional probabilities, i.e. the probability of a user mentioning NFL given they have NBA mentions?

$$ P(User_{NFL} | User_{NBA}) = \frac{P(User_{NFL} \cap User_{NBA})}{P(User_{NBA})} $$

Depending on what you want to show, try looking one of these metrics

$$ \begin{align} allConfidence(A,B) &= min \big\{P(A|B), P(B|A)\big\}\\ maxConfidence(A,B) &= max \big\{P(A|B), P(B|A)\big\}\\ Kulczynski(A,B) &= \frac{1}{2}\big(P(A|B) + P(B|A) \big)\\ \end{align} $$

I think maxConfidence might be what you're looking for but you can try all three and see what you get.

Best Answer

Related Solutions

Solved – How to correlate two time series, with possible time differences

Solved – Two highly correlated variables where both correlate with a third: Correlation and Causation

Related Question