Solved – Generating Random Sentences from Language Models

distributionsmachine learningnatural languageprobabilitysampling

I have different N gram language models trained on a corpus. I want to show that increasing N really leads to an improvement in modeling the training set.

I don't understand how to generate random sentences using them?

Please explain mathematically using probability terms and if possible any directions to implement. I assume this sampling from the learnt distribution/ model problem must be well studied, but I am not finding any resources, just a huge number of demonstrations of randomly generated data.

I know that once I start generating words, I can stop until a stop token such as ? or . is generated.

Same problem I have while generating random sentences using character level rnns trained on the same corpus.

Thanks.

Best Answer

First, figure out a way to start, which means have a way to randomly generate $N-1$ words.

Then, at each step of the generation:

Pick a word, uniformly at random.
Form an N-gram out of the picked word and the last $N-1$ words.
Look up the probability of that particular $N$-gram.
Generate a uniform random number between 0 and 1. If that number is smaller than the probability of your $N$-gram, "accept" the new word. Otherwise, go back to the start.\

Some details left as an exercise ;)

Related Solutions

Sampling – How to Generate Random Samples from a Custom Distribution in R?

It looks like you figured out that your code works, but @Aniko pointed out that you could improve its efficiency. Your biggest speed gain would probably come from pre-allocating memory for z so that you're not growing it inside a loop. Something like z <- rep(NA, nsamples) should do the trick. You may get a small speed gain from using vapply() (which specifies the returned variable type) instead of an explicit loop (there's a great SO question on the apply family).

> nsamples <- 1E5
> x <- runif(nsamples)
> f <- function(x, u) 1.5 * (x - (x^3) / 3) - u
> z <- c()
> 
> # original version
> system.time({
+ for (i in 1:nsamples) {
+   # find the root within (0,1) 
+   r <- uniroot(f, c(0,1), tol = 0.0001, u = x[i])$root
+   z <- c(z, r)
+ }
+ })
   user  system elapsed 
  49.88    0.00   50.54 
> 
> # original version with pre-allocation
> z.pre <- rep(NA, nsamples)
> system.time({
+ for (i in 1:nsamples) {
+   # find the root within (0,1) 
+   z.pre[i] <- uniroot(f, c(0,1), tol = 0.0001, u = x[i])$root
+   }
+ })
   user  system elapsed 
   7.55    0.01    7.78 
> 
> 
> 
> # my version with sapply
> my.uniroot <- function(x) uniroot(f, c(0, 1), tol = 0.0001, u = x)$root
> system.time({
+   r <- vapply(x, my.uniroot, numeric(1))
+ })
   user  system elapsed 
   6.61    0.02    6.74 
> 
> # same results
> head(z)
[1] 0.7803198 0.2860108 0.5153724 0.2479611 0.3451658 0.4682738
> head(z.pre)
[1] 0.7803198 0.2860108 0.5153724 0.2479611 0.3451658 0.4682738
> head(r)
[1] 0.7803198 0.2860108 0.5153724 0.2479611 0.3451658 0.4682738

And you don't need the ; at the end of each line (are you a MATLAB convert?).

Naive Bayes – Handling Unbalanced Classes in Naive Bayes Classification

Tackling the Poor Assumptions of Naive Bayes Text Classiffiers suggests some modifications to Naive Bayes in order to correct for biased sample sets.

Also have a look at this (and similar) CV posts on class imbalance, unbalanced class labels, etc.

Best Answer

Related Solutions

Sampling – How to Generate Random Samples from a Custom Distribution in R?

Naive Bayes – Handling Unbalanced Classes in Naive Bayes Classification

Related Question