I would like to create a topic distribution for a document.
The current model I am trying to implement is: for each sentence in the document, I am getting a topic assignment with a score, e.g. "1st sentence is about Microsoft with a relevance score of 0.4". I repeat this for each sentence, and at the end I have relevance scores with the topics like the following:
1st sentence: microsoft, score 0.4
2nd sentence: apple, score: 0.1
3rd sentence: android, score: 0.5
…
Now, I would like to convert these scores into 1 big probability distribution that will represent the whole document. Is there a known technique to do this? If so, what is the best way to do this?
Note: I know this is a very naive topic modelling, but I am currently interested in combining the scores into prob. distributions.
Best Answer
Are the original scores already probabilities? Then the obvious choice would be Bayes' Rule.
Otherwise, you might want to look at:
H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek
Proceedings of the 11th SIAM International Conference on Data Mining (SDM11), Mesa, AZ
http://siam.omnibooksonline.com/2011datamining/data/papers/018.pdf
While they focus on outlier scores, it should work for other domains, too. They also do some ensemble work there, and that probably is another field of literature where you should look for references. Because essentially, you are doing an ensemble.