Solved – How does LDA handle duplicate words as part of topic analysis

latent-dirichlet-allocpython

I'm a PhD student doing research into topic analysis of electronic communication. I'm currently looking at using LDA in python to input a body of text, remove stop words stem etc then generate a number of topic clusters based on the number of available non-stopped words. I'm curious as to how LDA handles duplicate words in a text corpora

So more concretely:
1. I start with 299 words.
2. After remove the stop words (english) and a few user names I have 160 words left.
3. Prior to LDA analysis I check out of the 160 words how many are unique. The answer is 111.

I guess my query is as follows: Is it the case that LDA conducts analysis over the 111 unique words rather than the 160 or does it actually compute probabilities for all instances of every word. If the latter is the case then how can one tell which instance of which word is part of a specific topic cluster?

In the LDA output below will appears below, how do i know which instance of will the output refers to? It appears three times in the list of 160 non stopped words below.

LDA Output

[(2, '0.048*"will" + 0.048*"now" + 0.048*"build" + 0.026*"someth" + 0.026*"said" + 0.026*"12" + 0.026*"hour" + 0.026*"16" + 0.026*"list" + 0.026*"hoari" + 0.026*"solut" + 0.026*"integr" + 0.026*"40" + 0.026*"approx" + 0.026*"mani" + 0.026*"interest" + 0.026*"tester" + 0.026*"don" + 0.026*"nv" + 0.026*"morn"'), (3, '0.136*"test" + 0.069*"can" + 0.047*"ati" + 0.047*"need" + 0.047*"fix" + 0.025*"will" + 0.025*"also" + 0.025*"user" + 0.025*"help" + 0.025*"ubuntu" + 0.025*"i386" + 0.025*"binari" + 0.025*"avail" + 0.025*"well" + 0.025*"want" + 0.025*"today" + 0.025*"ye" + 0.025*"brb" + 0.025*"upload" + 0.025*"much"')]

List of non-stopped words:

['ahoy']
['hey']
['morning']
['ati', 'flrdkjdjds', 'driver']
['skip', 'kernel', 'part']
['nvidia', 'one']
['skip', 'kernel', 'part']
['change', 'driver', 'names']
['now', 'let', 's', 'keep', 'separate', 'hackish']
['ok']
['will', 'work', 'one', 'integrated', 'solution', 'hoary']
['figured', 'probably', 'merged', 'one', 'script', 'without', 'much', 'trouble']
['whatever', 'easiest']
['happy', 'provide', 'script', 'x', 'directly']
['final']
['feel', 'happy', 'merge', 'just', 'go', 'ahead']
['need', 'fix', 'x', 'today']
['issues', 'remain', 'next', 'upload']
['bunches']
['said', 'something', '12', '16', 'hours', 'list']
['small', 'things', 'need', 'tested']
['need', 'tested']
['many', 'interested', 'testers', 'now']
['merge', 'already', 'work', 'yesterday']
['things', 'kbd', 'debian']
['bunch', 'small', 'bug', 'fixes']
['blind', 'uploads']
['want', 'test', 'much', 'can']
['course']
['x', 'doesn', 't', 'take', 'like', '5', 'minutes', 'build']
['can', 'also', 'upload', 'ubuntu', 'users', 'will', 'help', 'test']
['s', 'just', 'small', 'detail']
['2', 'major', 'changes', 'tested', 'already']
['nv', 'ati']
['tested', 'ati']
['minor', 'bug', 'fixes', 'first', 'need', 'make']
['test', 'fixes']
['can', 'test', 'ati', 'well', 'i386', 'binaries', 'available']
['don', 't', 'can', 'build']
['will', 'take', 'approx', '40', 'minutes']
['will', 'around']
['probably', 'yes']
['planning', 'stay', 'late', 'tonight']
['ok']
['building', 'now']
['brb']

Thanks in advance
Jonathan

Best Answer

LDA is basically doing a kind of matrix-decomposition of the Word-Document-Matrix. So every row in the stands for one word, each column for one Document. So to answer your question: LDA analyses only the 111 unique words.

On another note: LDA needs a quite high number of texts to work with. If you only have 160 words total in your text(s?), I doubt that you will find any useful topics with that procedure