Solved – Apriori versus Jaccard Coeff for recommender-system

aprioriassociation-rulesjaccard-similaritymachine learningrecommender-system

I developed a market basket program for a retailer. I used the pymining library association rules function. I assume this uses the Apriori algorithm. I got 340,000 rules for an input of 17,000 invoices each with a list of corresponding items.

My colleague developed a R program using Jaccard coefficient for the same set of 17,000 invoice items. The R program gave a set of 9000 similarity coefficients for combinations of 2 items. The store has 350 active items. This algorithm assumes that if 2 items were bought in the same basket, they were 'similar'. This had to be done because there is no rating available in this store.

I then displayed both our results in a common grid. The columns were Items in basket, items recommended by pymining, support, confidence, items recommended by Jaccard coefficient.

So for example, if I had items [1, 2, 3] in the basket, and [4, 5] in pymining recommended, I picked the items with highest similarity for [1,2,3] from the R output, and got say [4,7,8].

What I find is that there is very little correspondence between the recommendations of the market basket and the Jaccard coefficient.

  • Is this expected behavior?
  • Can we expect the outputs of these 2 algorithms to give similar outputs?

Best Answer

I would not expect these algorithms to give the same result.

The Jaccard index is the ratio of co-ocurrences to non-co-occurrences. That is, it's the number of baskets in which, e.g., items 4 and 5 appear together, divided by the number of baskets in which they do not appear together. The Jaccard index is only ever defined between two items.

This Apriori frequent itemsets algorithm gives different results because it's looking for something different. The Jaccard index is a unitless, relative index. An association rule, to quote the presentation, "is a pattern that states when $X$ occurs, $Y$ occurs with certain probability." But $X$ and $Y$ aren't just individual items: they're subsets of the itemset, meaning that association rules can involve more than two items. Moreover, an association rule is a specific kind of itemset that minimally satisfies particular criteria for minimum probability of occurrence, including two that are set as parameters for running the algorithm.

The biggest difference is the fact that the Jaccard index only ever compares two items at a time. The association rules generated by the Apriori algorithm compare itemsets that can be arbitrarily large (up to the size of the encompassing itemset). Again, I would not expect to get similar results from these two techniques.

edit: just adding a link to the paper from the comments for completeness: http://www.cs.uic.edu/~liub/teach/cs583-fall-05/CS583-association-rules.ppt

Related Question