I am using NaiveBayes for text classification, I am interested on tagging a text (like a blog post). What I am finding is that normally I have results in which a tag has a probability of 0.9999 of being applied to a text and then the closest tag that could also be applied to such text maybe has a probability of 0.000000001.
I started then playing with the "laplace" argument when training the NaiveBayes model and also wit the "threshold" and "eps" when predicting finding that it affects a lot the results… Unfortunately the package documentation does not help me much (maybe due to my lack of stats knowledge?).
The question is of course, which ones would be reasonable values for "eps" and "threshold"
As an example of a prediction done with the default values of "eps" and "threshold" (eps = 0 and threshold = 0.001)
TAG PROBABILITY
11 small group 1.000000e+00
9 party/nightlife 2.409428e-22
14 urban 9.573928e-30
4 family friendly/kids 2.428296e-32
2 couples 2.152852e-33
10 rural/country side 8.579935e-55
1 adventure 3.086100e-68
12 sports 1.443652e-93
13 transfer 1.405512e-111
5 food and drink 1.588900e-125
7 nature 1.729492e-127
3 cultural 1.142188e-177
8 outdoor/open air 8.541728e-247
6 historical 9.396091e-252
As you can see, one tag has a P=1 and the nthe next ones are really far away from it. But, if now I predict using "eps=1" and "threshold=0.1" I get this
TAG PROBABILITY
11 small group 0.5830973522
6 historical 0.2032655484
14 urban 0.0525206899
13 transfer 0.0503435678
4 family friendly/kids 0.0438961333
9 party/nightlife 0.0159622303
8 outdoor/open air 0.0155044089
3 cultural 0.0110031289
7 nature 0.0082023324
5 food and drink 0.0049013938
12 sports 0.0041011662
10 rural/country side 0.0033009387
1 adventure 0.0031008818
2 couples 0.0008002276
And for example if I use "eps=1" and "threshold=0.5" I get this
TAG PROBABILITY
8 outdoor/open air 0.234848485
3 cultural 0.166666667
7 nature 0.124242424
6 historical 0.116666667
5 food and drink 0.074242424
12 sports 0.062121212
13 transfer 0.053030303
10 rural/country side 0.050000000
1 adventure 0.046969697
14 urban 0.024242424
4 family friendly/kids 0.016666667
2 couples 0.012121212
11 small group 0.012121212
9 party/nightlife 0.006060606
Best Answer
It's been a while, so you've probably gotten your answer! But here's my understanding, in case it helps someone else who's looking. I'm also a newby to e1071 and having trouble determining appropriate values for the following arguments. But at the very least, here are how to use them: