Solved – Is WEKA’s logistic regression tool slow? If yes, are there faster alternatives

javalogisticregressionweka

I have to run logistic regression several thousand times, within a Java/Scala program. A general spec for one
of the problems is:

~300 continuous attributes

~<10000 positives, and randomly pick double the number of negatives as positives

The attributes are the same for all problems, and the positives/negatives are always drawn from the same pool, of about 100,000.

I'm finding that naively plugging this into WEKA takes a few minutes per problem, which is unsuitable given the number of tasks which need to be performed. Is there a faster library I can use, or is there a way to parallelize these problems, or something else?

Best Answer

The fastest I know of is Vowpal Wabbit by John Langford and his teams, first in Yahoo! and then at Microsoft. The implementation tweaks and tricks in that code are exceptional.

Its implementation is in C++. So, one option is to call it as an external tool, since it accepts data as files from disk or from stdin. More interesting option is to use it as a web service since it provides a network interface for data input. Probably easiest is the first one, while the second could be way more efficient (no read/writing on disk) but needs some more time invested in programming if you haven't done something similar before. In either case you can first export some data in its format and check out if its speed is up to your expectations.