Solved – Chi-squared test for detecting trending terms

chi-squared-testtext miningtrend

I'm trying to find bursty(trending) terms in a text stream. There are two frame in a stream; expected frame and observed frame.

For each frame, I tokenize documents(tweets/blog posts etc.) up into terms(words), so I have two term-frequency lists:

For Expected Frame:

  • term a, 2500
  • term b, 2000
  • term c, 1500
  • term m, 23
  • term x, 11

For Observed Frame:

  • term a, 2600
  • term b, 1900
  • term m, 1500
  • term x, 15

Now, I need a method to find 'term m' as a bursty term. I think chi-squared test can tell which terms is bursty, in other words which term has changed significantly.

As you can see, I'm not interested in detecting whether whole observed frame is changed or not, but each individual term.

My questions is that does chi-squared test fit this problem and what other methods are there?

If I use chi-squared, what is the critical value for 'term m', how can I calculate? "chi-value= power( 1500-23 ,2) / 23" is ok?

Best Answer

A Chi-squared test will only tell you that the observed frame is different from the expected, not which single (or more) term is different. You have to calculate the Chi square statistic on the whole data set at once. The statistic itself is the sum of the scaled squared difference of each observation from its expectation:

$\sum{\frac{(observed-expected)^2}{expected}}$

Under the null hypothesis that the observed were generated from the probabilities set out behind the expected, this has a Chi-squared distribution with 23 degrees of freedom so if it is more than the critical value for your hypothesis test (so 35.2 if your critical value is 0.05) then you reject the null hypothesis that the observed figures were all generated as expected.

Given your particular question, a good follow up to a Chi-square test would be to calculate and plot the "Pearson residuals":

$\frac{(observed-expected)}{\sqrt{expected}}$

This should show which of the 24 letters are contributing a Chi-square statistic that exceeds the critical limit. This will give you an indicator of which are the bursty terms. If there is only one bursty term but the rest are basically in line with expectations, this will be very obvious from the dot chart of the residuals.

Related Question