Association analysis isn't a technique I really use very often so hopefully someone else can chime in, but I think the approach you're using makes sense. The real question is: how do you want to define "difference" between your models? If your main concern is lift, then I believe your approach makes sense. You might also want to try measuring the difference in confidence and support as well, so you could have three sets of rules with large "differences" to explore (different lift, different support, and different confidence.
It also might be worthwhile to develop a metric that combines the differences in lift, confidence, and support. If you value one of these statistics more than the others (looks like you're mainly interested in lift) you could up/down weight particular statistics in your metric, i.e. take a weighted average. If you're going to combine these statistics, you should consider rescaling your values based on the max value each statistic achieves in a given model or across both models. You'll really have to play around a bit to decide what works best for you.
One problem I see with your approach is that you are filtering out rules below a threshhold from your models: it's very likely that some rules will appear in one model above your threshholds but not the other. As a consequence, rules that are the most different between your two models probably won't appear in your calculations at all. Perhaps one rule has very high support, confidence and lift in one model but negligible support, confidence and lift in the other. Theoretically, this is precisely the kind of rule you are trying to target, but you won't be able to calculate your difference metric at all (if I understand your stated process correctly).
Here's how I would recommend modifying your procedure:
Instead of removing rules below a threshhold from the rulesets for both models, retain all the rules in each ruleset but for the purposes of your calculation only consider rules that are above your given threshholds in at least one of your models. This way, you will target your most powerful rules, but still be able to calculate the difference between your two models in the case that some rules have very low support, confidence, or lift in one model but not the other. Alternatively, at the very least you should ensure that you calculate the appropriate statistics for the rules that appear in each model above your threshold (i.e. trim down the ruleset for each model, but then for all rules that appear in model A but not B, calculate the statistics for those rules in model B).
It's probable that this is already you're approach and I misunderstood the description you gave of your process, in which case my only suggested modification is considering metrics that take the other statistics of interest into account. I felt it was worth pointing out in case your approach was overly naive.
As I understand it, your problem is that there are new words in the new document that have not been seen in any previous documents. As a result, if you make a tfidf matrix using the words in all the previous documents + the new document, you ends up with a matrix of higher dimension than you previously had for the other documents.
I'd suggest updating the tfidf matrices for the previous documents using the new words. The new columns will be all zero for the previous documents, as they don't have the new words. You can just keep doing this in an online manner, and then you can measure distances using these revised matrices. Definitely use sparse matrices, as this'll get big.
Best Answer
You can consider frequent itemsets to be a specific form of clustering designed for market basket data. On such data, it is much more meaningful than what you would get with a traditional partitioning algorithm like k-means. K-means needs to put every item into a cluster, and you need to know the number of clusters beforehand. Frequent itemset mining can handle that you may have items that are barely ever (or never) sold, and for which you do not have enough data to assign them in any meaningful way. That is why you use frequent itemset mining and not clustering.