Solved – Does one need to adjust for document length (in terms of pages) in topic modeling

natural languagetext miningtopic-models

I am thinking about whether one needs to normalize or weight a topic
model by document length (page length)?

I am estimating a topic model using social science (JSTOR) articles, where
they vary in length between min 5 pages to max 200 pages. I want to analyse
a specific topic, namely, the degree of economic topics in social science
articles.

I can see that a similar question was raised back in 2011, but no clear suggestion was reached, as far as I can interpret the discussion:

https://lists.cs.princeton.edu/pipermail/topic-models/2011-February/001171.html

My intuition about this question is somewhat split here.

On the one hand it seems logical that one needs to weight by document
length since larger documents (a 200 page long document) will have more
pages to refer to a specific topic (in my case “economic) than shorter
document (a 5 pages long documents). This will be reflected for example in
a document-term-matrix where economic terms (e.g. markets, business, and
industry) will have much higher frequency in row for the 200 page document
compared to the row of the 5 page document. Moreover, the 200 page document will affect the overall term distribution of words. In other words, the terms of the 200-page document will dominate the term per document ratio for each and every term in the document-term-matrix.

On the other hand, the topic-term ratio seems to adjust for the fact that
we have longer and shorter documents in the sample. Even if the
term-document ratio is high for longer documents and lower for shorter
documents, the relative frequency (proportion of various terms for the
longer documents is comparable with the shorter. For example, the shorter
document might have a sum of 10 for the frequency (tokens) of economic
topics of a total of 30 tokens: gives an economic topic probability of
10/30. Whereas the longer document might have a sum of 100 for the
frequency of economic topics of a total of 3000 tokens (all topics): a
ratio of 100/3000.

Accordingly, even if the shorter document has fewer tokens for economic
topics than the longer document, it is still estimated to be more economic
than the longer document.

I am not sure what to conclude from this: can I trust a
page-unadjusted LDA results? I am using package topicmodels in R.

Many thanks in advance for your input

Best Answer

I haven't used topic models much, but I can say that if you are to apply usual clustering methods to un-normalized document-term matrices (even when the dimensionality of the data is reduced with LSA), you'll see that longer articles will tent to cluster together, just because they have more words.

So you may take a look at some of your topics and see if documents inside make sense. Also, try to calculate average length of document per topic and see if the phenomenon I mention takes place or not.

Then you can repeat the same on the unit-normalized data and see if the results make more sense or not.