Solved – Vowpal wabbit LDA – Math Solves Everything

I am trying to use vowpal wabbit to do Latent Dirichlet Analysis (LDA) on a corpus. I am running into a few issues regarding the output.

To test it, I was using a file with just 3 lines (3 documents as per the VW input format):

| now let fit a topic model on this dataset
| now let a good model on this dataset
| this is a document about sports

I ran VW LDA in the following manner:

vw --lda 2 --lda_D 3 --readable_model lda.model.txt -k --passes 10 
   --cache_file doc_tokens.cache -d 1.txt -p prediction.dat --lda_rho 0.1

The code runs fine and generates two output files prediction.dat and lda.model.txt. My questions related to it are:

Except the first column, both the output files have a sequence of floating point numbers. Like

262130 0.100008 0.100009
262131 0.100013 0.100021
262132 0.100008 0.100010
262133 0.100018 0.100008
262134 0.100005 0.100008
262135 0.100010 0.100007
262136 0.100008 0.100026
262137 0.100005 0.100012
262138 0.100008 0.100014
262139 0.100005 0.100018
262140 0.100006 0.100007
262141 0.100006 0.100006
262142 0.100019 0.100023
262143 0.100019 0.100007

I thought giving --readable_model will give the strings representing the topics. Am I doing something wrong?

No matter how many documents I give, the output file (lda.model.txt) has 262143 rows of examples. Why is it doing that?

Best Answer

There are 2 columns of floating-point numbers because you specified 2 topics in your LDA model with the number immediately after --lda.

The first column is numeric and defaults to 262143 elements independent of input size because of the feature hashing that Vowpal Wabbit does. The --help text for --readable_model arg says "Output human-readable final regressor with numeric features" so that is by design, even though it might not pass all tests for "human-readable" (see UX.SE for more discussion on that topic). You can change the number of rows with the -b option (example given here: "-b 16: We expect to see at most 2^16 unique words."). The default is -b 18; 2^18-1 = 262143 rows.

If you convert terms to numbers using an external dictionary so your input file has integers in place of words, VW will conveniently use those integers as the hash value directly, without requiring --audit or --invert_hash.

Best Answer

Related Solutions

Solved – Given an LDA model, how can I calculate p(word|topic,party), where each document belongs to a party

Solved – the labels for SVM classification when we firstly run LDA (lda->SVM)

Related Question