Solved – Market Basket Analysis: comparing rules between two models

association-rules

Given two independent MBA models 1 and 2 (each model is a set of rules with calcualted support, confidence and lift metrics) that were generated on subsets of large population of transactions, how to effectively compare rules between models? In particular, how to detect the biggest movers (up and down) from one model to the other.

So far, my approach has been as follows:

set thresholds for support, confidence, and lift (e.g. given rule $ A: support(A) > 0.001, confidence(A) > 0.1, lift(A) > 10 $) and filter out rules below these thresholds from both models;
for each rule $ A $ compare its log-lifts between models 1 and 2: $$ \mid log(lift_1(A)) – log(lift_2(A)) \mid $$ and select rules with largest difference

The goal is to identify set of rules that change the most between two models. I am looking for both validation of given approach and better alternatives – thank you!

Best Answer

Association analysis isn't a technique I really use very often so hopefully someone else can chime in, but I think the approach you're using makes sense. The real question is: how do you want to define "difference" between your models? If your main concern is lift, then I believe your approach makes sense. You might also want to try measuring the difference in confidence and support as well, so you could have three sets of rules with large "differences" to explore (different lift, different support, and different confidence.

It also might be worthwhile to develop a metric that combines the differences in lift, confidence, and support. If you value one of these statistics more than the others (looks like you're mainly interested in lift) you could up/down weight particular statistics in your metric, i.e. take a weighted average. If you're going to combine these statistics, you should consider rescaling your values based on the max value each statistic achieves in a given model or across both models. You'll really have to play around a bit to decide what works best for you.

One problem I see with your approach is that you are filtering out rules below a threshhold from your models: it's very likely that some rules will appear in one model above your threshholds but not the other. As a consequence, rules that are the most different between your two models probably won't appear in your calculations at all. Perhaps one rule has very high support, confidence and lift in one model but negligible support, confidence and lift in the other. Theoretically, this is precisely the kind of rule you are trying to target, but you won't be able to calculate your difference metric at all (if I understand your stated process correctly).

Here's how I would recommend modifying your procedure:

Instead of removing rules below a threshhold from the rulesets for both models, retain all the rules in each ruleset but for the purposes of your calculation only consider rules that are above your given threshholds in at least one of your models. This way, you will target your most powerful rules, but still be able to calculate the difference between your two models in the case that some rules have very low support, confidence, or lift in one model but not the other. Alternatively, at the very least you should ensure that you calculate the appropriate statistics for the rules that appear in each model above your threshold (i.e. trim down the ruleset for each model, but then for all rules that appear in model A but not B, calculate the statistics for those rules in model B).

It's probable that this is already you're approach and I misunderstood the description you gave of your process, in which case my only suggested modification is considering metrics that take the other statistics of interest into account. I felt it was worth pointing out in case your approach was overly naive.

Related Solutions

Solved – Market Basket Analysis using Clustering to discover new product combinations

k-means or clustering won't get you anywhere.

Frequent itemset mining is most appropriate for this data type.

Yes, it will discover combos you have been offering before. But the solution is simple: clean your data.

Option 1) remove known combos

Option 2) treat known combos as a single item (i.e. customer hought combo-1, not burger and fries separately)

Option 3) ignore frequent patterns / association rules that you already use(d).

The ability to discover the combos that you had just demonstrates that it worked! Did you get anything remotely useful from k-means?!?

Market Basket Analysis – Interpretation of Mirrored Association Rules

The existing answer explains how the table is calculated. If you are still confused, one way to look at it is to start with the number of people who bought things.

Say 100 people visited the cafe, and 36 bought coffee, 18 bought pie, and 8 bought both. Then this is how the numbers in your table are calculated, using the formulas given by b-r-oleary:

P(A)	P(C)	P(A,C)	P(C\|A)	P(A,C)/P(A)P(C)	P(A,C)-P(A)P(C)	(1-P(C))/(1-P(C\|A))
36/100	18/100	8/100	8/36	100 x 8/(18x36)	8/100 - (18/100)(36/100)	(1-18/100)/(1-8/36)
18/100	36/100	8/100	8/18	100 x 8/(18x36)	8/100 - (36/100)(18/100)	(1-36/100)/(1-8/18)

Out of 18 people who bought pie, 8 also bought coffee, so the confidence is 8/18. But out of 36 people who bought coffee, only 8 also bought pie, so the confidence is 8/36.

The numbers in bold are the ones which aren't necessarily equal. This is just a consequence of how they are defined. The names "support", "lift" etc. are just names, which hopefully hint at how the numbers should be interpreted.

Best Answer

Related Solutions

Solved – Market Basket Analysis using Clustering to discover *new* product combinations

Market Basket Analysis – Interpretation of Mirrored Association Rules

Related Question

Solved – Market Basket Analysis using Clustering to discover new product combinations