If I have a large set of transactions where in each has a set of goods and I want to do market basket analysis (affinity analysis) using Apriori. However, compared to traditional supervised machine learning algorithms like Linear Regression, Random Forests, Gradient Boost, etc there does not appear to be a corresponding methodology where you split into a train and test set, train on the train dataset and check for cross validation on the test set for algorithms such as Apriori. How do you know that your model is truly good? Are there some other metrics that can be used to ensure you are not overfitting your model and that it has no bias?
Solved – How to validate the association rules or results obtained from Market Basket Analysis? Train-test methodology
aprioricross-validationr
Related Solutions
If you perform feature selection on all of the data and then cross-validate, then the test data in each fold of the cross-validation procedure was also used to choose the features and this is what biases the performance analysis.
Consider this example. We generate some target data by flipping a coin 10 times and recording whether it comes down as heads or tails. Next, we generate 20 features by flipping the coin 10 times for each feature and write down what we get. We then perform feature selection by picking the feature that matches the target data as closely as possible and use that as our prediction. If we then cross-validate, we will get an expected error rate slightly lower than 0.5. This is because we have chosen the feature on the basis of a correlation over both the training set and the test set in every fold of the cross-validation procedure. However, the true error rate is going to be 0.5 as the target data is simply random. If you perform feature selection independently within each fold of the cross-validation, the expected value of the error rate is 0.5 (which is correct).
The key idea is that cross-validation is a way of estimating the generalization performance of a process for building a model, so you need to repeat the whole process in each fold. Otherwise, you will end up with a biased estimate, or an under-estimate of the variance of the estimate (or both).
HTH
Here is some MATLAB code that performs a Monte-Carlo simulation of this setup, with 56 features and 259 cases, to match your example, the output it gives is:
Biased estimator: erate = 0.429210 (0.397683 - 0.451737)
Unbiased estimator: erate = 0.499689 (0.397683 - 0.590734)
The biased estimator is the one where feature selection is performed prior to cross-validation, the unbiased estimator is the one where feature selection is performed independently in each fold of the cross-validation. This suggests that the bias can be quite severe in this case, depending on the nature of the learning task.
NF = 56;
NC = 259;
NFOLD = 10;
NMC = 1e+4;
% perform Monte-Carlo simulation of biased estimator
erate = zeros(NMC,1);
for i=1:NMC
y = randn(NC,1) >= 0;
x = randn(NC,NF) >= 0;
% perform feature selection
err = mean(repmat(y,1,NF) ~= x);
[err,idx] = min(err);
% perform cross-validation
partition = mod(1:NC, NFOLD)+1;
y_xval = zeros(size(y));
for j=1:NFOLD
y_xval(partition==j) = x(partition==j,idx(1));
end
erate(i) = mean(y_xval ~= y);
plot(erate);
drawnow;
end
erate = sort(erate);
fprintf(1, ' Biased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));
% perform Monte-Carlo simulation of unbiased estimator
erate = zeros(NMC,1);
for i=1:NMC
y = randn(NC,1) >= 0;
x = randn(NC,NF) >= 0;
% perform cross-validation
partition = mod(1:NC, NFOLD)+1;
y_xval = zeros(size(y));
for j=1:NFOLD
% perform feature selection
err = mean(repmat(y(partition~=j),1,NF) ~= x(partition~=j,:));
[err,idx] = min(err);
y_xval(partition==j) = x(partition==j,idx(1));
end
erate(i) = mean(y_xval ~= y);
plot(erate);
drawnow;
end
erate = sort(erate);
fprintf(1, 'Unbiased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));
k-means or clustering won't get you anywhere.
Frequent itemset mining is most appropriate for this data type.
Yes, it will discover combos you have been offering before. But the solution is simple: clean your data.
Option 1) remove known combos
Option 2) treat known combos as a single item (i.e. customer hought combo-1, not burger and fries separately)
Option 3) ignore frequent patterns / association rules that you already use(d).
The ability to discover the combos that you had just demonstrates that it worked! Did you get anything remotely useful from k-means?!?
Best Answer
Market basket analysis traditionally isn't predictive, it's inferential. It looks in the past to determine what items were bought together, and it makes the assumption that the trends of the past will continue.
Regarding the reliability of the estimates obtained from market basket analysis, it boils down to sample size of the number of base items and co-occurrences that you have.
In theory, one could conduct statistical tests of significance (or construct confidence intervals) around all estimates of support, confidence, and lift to determine if the relationship is real.
In practice, to make sure that enough data is available, focus is usually put on the item sets that have the high support and high lift (not just high lift). Intuitively, this will give you the relationships that are the most likely to be significant, even without statistical testing.