Solved – Generating Random Sentences from Language Models

distributionsmachine learningnatural languageprobabilitysampling

I have different N gram language models trained on a corpus. I want to show that increasing N really leads to an improvement in modeling the training set.

I don't understand how to generate random sentences using them?

Please explain mathematically using probability terms and if possible any directions to implement. I assume this sampling from the learnt distribution/ model problem must be well studied, but I am not finding any resources, just a huge number of demonstrations of randomly generated data.

I know that once I start generating words, I can stop until a stop token such as ? or . is generated.

Same problem I have while generating random sentences using character level rnns trained on the same corpus.

Thanks.

Best Answer

First, figure out a way to start, which means have a way to randomly generate $N-1$ words.

Then, at each step of the generation:

  • Pick a word, uniformly at random.
  • Form an N-gram out of the picked word and the last $N-1$ words.
  • Look up the probability of that particular $N$-gram.
  • Generate a uniform random number between 0 and 1. If that number is smaller than the probability of your $N$-gram, "accept" the new word. Otherwise, go back to the start.\

Some details left as an exercise ;)

Related Question