I am trying to use vowpal wabbit to do Latent Dirichlet Analysis (LDA) on a corpus. I am running into a few issues regarding the output.
To test it, I was using a file with just 3 lines (3 documents as per the VW input format):
| now let fit a topic model on this dataset
| now let a good model on this dataset
| this is a document about sports
I ran VW LDA in the following manner:
vw --lda 2 --lda_D 3 --readable_model lda.model.txt -k --passes 10
--cache_file doc_tokens.cache -d 1.txt -p prediction.dat --lda_rho 0.1
The code runs fine and generates two output files prediction.dat
and lda.model.txt
. My questions related to it are:
-
Except the first column, both the output files have a sequence of floating point numbers. Like
262130 0.100008 0.100009 262131 0.100013 0.100021 262132 0.100008 0.100010 262133 0.100018 0.100008 262134 0.100005 0.100008 262135 0.100010 0.100007 262136 0.100008 0.100026 262137 0.100005 0.100012 262138 0.100008 0.100014 262139 0.100005 0.100018 262140 0.100006 0.100007 262141 0.100006 0.100006 262142 0.100019 0.100023 262143 0.100019 0.100007
I thought giving
--readable_model
will give the strings representing the topics. Am I doing something wrong? -
No matter how many documents I give, the output file (
lda.model.txt
) has262143
rows of examples. Why is it doing that?
Best Answer
There are 2 columns of floating-point numbers because you specified 2 topics in your LDA model with the number immediately after --lda.
The first column is numeric and defaults to 262143 elements independent of input size because of the feature hashing that Vowpal Wabbit does. The
--help
text for--readable_model arg
says "Output human-readable final regressor with numeric features" so that is by design, even though it might not pass all tests for "human-readable" (see UX.SE for more discussion on that topic). You can change the number of rows with the -b option (example given here: "-b 16
: We expect to see at most 2^16 unique words."). The default is-b 18
; 2^18-1 = 262143 rows.If you convert terms to numbers using an external dictionary so your input file has integers in place of words, VW will conveniently use those integers as the hash value directly, without requiring
--audit
or--invert_hash
.