Solved – A single document as input to LDA

machine learningnatural languagetext miningtopic-models

We use topic modelling usually on a collection of documents – which makes the input. But what if I only have a single document where I want to see the underlying topics in it? I have heard that you can break them by paragraphs in cases like that, but what is the need for that? Does that mean I can't use latent dirichlet allocation (LDA) or it is not supposed to use with a single document as the input?

Best Answer

You can use a sentence splitter and split your document into sentences. I have never used the approach myself, but the tool is available with the open.nlp package in R, Python and Rapidminer.

What you could also do is to train a topicmodel on corpus with clearly defined topics. Next you use the same model on your one document and you see how the topic structure turn out.