Solved – the meaning of laplace, eps and threshold in NaiveBayes package in R e1071 lib

machine learningnaive bayesrtext mining

I am using NaiveBayes for text classification, I am interested on tagging a text (like a blog post). What I am finding is that normally I have results in which a tag has a probability of 0.9999 of being applied to a text and then the closest tag that could also be applied to such text maybe has a probability of 0.000000001.

I started then playing with the "laplace" argument when training the NaiveBayes model and also wit the "threshold" and "eps" when predicting finding that it affects a lot the results… Unfortunately the package documentation does not help me much (maybe due to my lack of stats knowledge?).

The question is of course, which ones would be reasonable values for "eps" and "threshold"

As an example of a prediction done with the default values of "eps" and "threshold" (eps = 0 and threshold = 0.001)

                    TAG   PROBABILITY
11          small group  1.000000e+00
9       party/nightlife  2.409428e-22
14                urban  9.573928e-30
4  family friendly/kids  2.428296e-32
2               couples  2.152852e-33
10   rural/country side  8.579935e-55
1             adventure  3.086100e-68
12               sports  1.443652e-93
13             transfer 1.405512e-111
5        food and drink 1.588900e-125
7                nature 1.729492e-127
3              cultural 1.142188e-177
8      outdoor/open air 8.541728e-247
6            historical 9.396091e-252

As you can see, one tag has a P=1 and the nthe next ones are really far away from it. But, if now I predict using "eps=1" and "threshold=0.1" I get this

                    TAG  PROBABILITY
11          small group 0.5830973522
6            historical 0.2032655484
14                urban 0.0525206899
13             transfer 0.0503435678
4  family friendly/kids 0.0438961333
9       party/nightlife 0.0159622303
8      outdoor/open air 0.0155044089
3              cultural 0.0110031289
7                nature 0.0082023324
5        food and drink 0.0049013938
12               sports 0.0041011662
10   rural/country side 0.0033009387
1             adventure 0.0031008818
2               couples 0.0008002276

And for example if I use "eps=1" and "threshold=0.5" I get this

                TAG PROBABILITY
8      outdoor/open air 0.234848485
3              cultural 0.166666667
7                nature 0.124242424
6            historical 0.116666667
5        food and drink 0.074242424
12               sports 0.062121212
13             transfer 0.053030303
10   rural/country side 0.050000000
1             adventure 0.046969697
14                urban 0.024242424
4  family friendly/kids 0.016666667
2               couples 0.012121212
11          small group 0.012121212
9       party/nightlife 0.006060606

Best Answer

It's been a while, so you've probably gotten your answer! But here's my understanding, in case it helps someone else who's looking. I'm also a newby to e1071 and having trouble determining appropriate values for the following arguments. But at the very least, here are how to use them:

  1. Threshold is a value that will replace any probabilities less than whatever you've set for eps
  2. eps is the maximum value you'd like to keep "as is." Anything less than this will be replaced by your threshold value.
  3. laplace adds in a 1 for any combination of factors that never occur. I always include laplace = 1, mostly because I haven't discovered the nuances of when not to use it!