Solved – Market Basket Analysis: comparing rules between two models

association-rules

Given two independent MBA models 1 and 2 (each model is a set of rules with calcualted support, confidence and lift metrics) that were generated on subsets of large population of transactions, how to effectively compare rules between models? In particular, how to detect the biggest movers (up and down) from one model to the other.

So far, my approach has been as follows:

  • set thresholds for support, confidence, and lift (e.g. given rule $ A: support(A) > 0.001, confidence(A) > 0.1, lift(A) > 10 $) and filter out rules below these thresholds from both models;
  • for each rule $ A $ compare its log-lifts between models 1 and 2: $$ \mid log(lift_1(A)) – log(lift_2(A)) \mid $$ and select rules with largest difference

The goal is to identify set of rules that change the most between two models. I am looking for both validation of given approach and better alternatives – thank you!

Best Answer

Association analysis isn't a technique I really use very often so hopefully someone else can chime in, but I think the approach you're using makes sense. The real question is: how do you want to define "difference" between your models? If your main concern is lift, then I believe your approach makes sense. You might also want to try measuring the difference in confidence and support as well, so you could have three sets of rules with large "differences" to explore (different lift, different support, and different confidence.

It also might be worthwhile to develop a metric that combines the differences in lift, confidence, and support. If you value one of these statistics more than the others (looks like you're mainly interested in lift) you could up/down weight particular statistics in your metric, i.e. take a weighted average. If you're going to combine these statistics, you should consider rescaling your values based on the max value each statistic achieves in a given model or across both models. You'll really have to play around a bit to decide what works best for you.

One problem I see with your approach is that you are filtering out rules below a threshhold from your models: it's very likely that some rules will appear in one model above your threshholds but not the other. As a consequence, rules that are the most different between your two models probably won't appear in your calculations at all. Perhaps one rule has very high support, confidence and lift in one model but negligible support, confidence and lift in the other. Theoretically, this is precisely the kind of rule you are trying to target, but you won't be able to calculate your difference metric at all (if I understand your stated process correctly).

Here's how I would recommend modifying your procedure:

Instead of removing rules below a threshhold from the rulesets for both models, retain all the rules in each ruleset but for the purposes of your calculation only consider rules that are above your given threshholds in at least one of your models. This way, you will target your most powerful rules, but still be able to calculate the difference between your two models in the case that some rules have very low support, confidence, or lift in one model but not the other. Alternatively, at the very least you should ensure that you calculate the appropriate statistics for the rules that appear in each model above your threshold (i.e. trim down the ruleset for each model, but then for all rules that appear in model A but not B, calculate the statistics for those rules in model B).

It's probable that this is already you're approach and I misunderstood the description you gave of your process, in which case my only suggested modification is considering metrics that take the other statistics of interest into account. I felt it was worth pointing out in case your approach was overly naive.

Related Question