Solved – Vowpal wabbit LDA

latent-dirichlet-alloctopic-modelsvowpal-wabbit

I am trying to use vowpal wabbit to do Latent Dirichlet Analysis (LDA) on a corpus. I am running into a few issues regarding the output.

To test it, I was using a file with just 3 lines (3 documents as per the VW input format):

| now let fit a topic model on this dataset
| now let a good model on this dataset
| this is a document about sports

I ran VW LDA in the following manner:

vw --lda 2 --lda_D 3 --readable_model lda.model.txt -k --passes 10 
   --cache_file doc_tokens.cache -d 1.txt -p prediction.dat --lda_rho 0.1

The code runs fine and generates two output files prediction.dat and lda.model.txt. My questions related to it are:

  1. Except the first column, both the output files have a sequence of floating point numbers. Like

    262130 0.100008 0.100009
    262131 0.100013 0.100021
    262132 0.100008 0.100010
    262133 0.100018 0.100008
    262134 0.100005 0.100008
    262135 0.100010 0.100007
    262136 0.100008 0.100026
    262137 0.100005 0.100012
    262138 0.100008 0.100014
    262139 0.100005 0.100018
    262140 0.100006 0.100007
    262141 0.100006 0.100006
    262142 0.100019 0.100023
    262143 0.100019 0.100007
    

    I thought giving --readable_model will give the strings representing the topics. Am I doing something wrong?

  2. No matter how many documents I give, the output file (lda.model.txt) has 262143 rows of examples. Why is it doing that?

Best Answer

There are 2 columns of floating-point numbers because you specified 2 topics in your LDA model with the number immediately after --lda.

The first column is numeric and defaults to 262143 elements independent of input size because of the feature hashing that Vowpal Wabbit does. The --help text for --readable_model arg says "Output human-readable final regressor with numeric features" so that is by design, even though it might not pass all tests for "human-readable" (see UX.SE for more discussion on that topic). You can change the number of rows with the -b option (example given here: "-b 16: We expect to see at most 2^16 unique words."). The default is -b 18; 2^18-1 = 262143 rows.

If you convert terms to numbers using an external dictionary so your input file has integers in place of words, VW will conveniently use those integers as the hash value directly, without requiring --audit or --invert_hash.

Related Question