Data Similarity – How to Determine How Similar Groups of Data Are to Each Other

distributionsprobabilitysubset

I have a pool of 15,000 unique* items. From that set, there are roughly 3200 sets of 60 items, selected non-randomly (in fact, for the sake of the framing of the question, let's say that each item is intentionally chosen for a specific reason). I have the list of all items that are in at least one set; said list is about 5500 items long, and also includes the total number of sets each item is in. The most popular item is in roughly 60% of the sets, and the list goes all the way down to items that only appear in 1 set.

Additionally, there are similarities that exist between the different sets that manifest themselves in the form of tendencies. For instance (with random values given), 30% of the sets are similar according to one "tendency group", and tend to include items #1, #50, #2006, etc. more often than other tendency groups; 20% might be similar to each other in a different way, and tend to include items #65, #700, #5000, etc. more often than other tendency groups. Thus, there's not exactly one set of "expected values" (which would make this question similar to this one), but rather upwards of a dozen groups of "somewhat similar values." I looked also at this question, but, as someone with a limited grasp of the subject, it doesn't seem like it's exactly what I'm looking for.

How do I compare each set to each other in such a way that I'm able to compare each item from a given set and see how it compares to other items that appear in similar sets?

Specifically, if given a specific set, if I were to replace one item, which item would I take out, and what would I replace it with, to make the set more similar (less unique) compared to other sets. Extra points if the method is capable of making the set more similar to one specific (or perhaps simply being selective towards one) tendency group (or more).

In typing out this question, I've become aware of the Jaccard index, but that seems like it's more suited for comparing two sets of data to each other, not looking at an individual item across multiple sets and looking for similarities in that respect, but I could very much be wrong.

*I say unique; even though they may differ in one aspect, some items functionally or even have exactly the same characteristics except for their name, but I'm not sure how much that matters.

Edit1: the unique items are Magic: the Gathering cards, and the sets represent individual decks. Does that help?

Edit2: Since posting this question, I have become aware of the existence of "market basket" analysis, but, as a laymen in this field, I have no idea where to even begin to conduct said analysis.

Best Answer

It is unclear from your question exactly what your objective is for this analysis, but if I understand correctly, you would like to be able to quantify the similarity (or "distance") between two MtG decks, so that you can alter a deck to make it more or less similar to other decks in the sample. As I will show below, this exercise is essentially equivalent to asking how we can measure "similarity" or "distance" between categorical random variables falling over some fixed set of categories.


Suppose you have $n$ cards and $m$ decks and let $r_{i,j}$ represent the count of card $i$ in deck $j$.$^\dagger$ You can hold all of these count values in an $n \times m$ matrix which is the contingency table for the data:

$$\mathbf{R} = \begin{bmatrix} r_{1,1} & r_{1,2} & \cdots & r_{1,m} \\ r_{2,1} & r_{2,2} & \cdots & r_{2,m} \\ \vdots & \vdots & \ddots & \vdots \\ r_{n,1} & r_{n,2} & \cdots & r_{n,m} \\ \end{bmatrix}.$$

Each column $\mathbf{r}_{j}$ represents a single deck, giving the count values for each card in the deck (most of which will be zeros). One useful way to think of a deck is as a categorical random variable giving a random card from that deck. Let $X_j$ be a random card from deck $j$, which has distribution:

$$X_j \sim \text{Categorical}(\boldsymbol{p}_j) \quad \quad \quad \boldsymbol{p}_j = \frac{\mathbf{r}_j}{r_{\bullet j}}.$$

Since each deck is represented by a categorical random variable, we can employ statistical measures of association between categorical random variables. For example, the chi-squared measure of association between decks $j$ and $j'$ is obtained by looking at the $n \times 2$ contingency table for only those two decks:

$$\chi^2 = \sum_{i=1}^n \Bigg[ \frac{(r_{\bullet \bullet} r_{i,j} - r_{i \bullet} r_{\bullet j})^2}{r_{\bullet \bullet} r_{i \bullet}r_{\bullet j}} + \frac{(r_{\bullet \bullet} r_{i,j'} - r_{i \bullet} r_{\bullet j'})^2}{r_{\bullet \bullet} r_{i \bullet} r_{\bullet j'}} \Bigg].$$

This is a standard measure of association for categorical random variables, so it will be easy for others to recognise, and its properties are well-known. It can be extended to get a more general measure of the similarity between a set of decks, by enlarging the contingency table to include the relevant set of decks. In order to alter a deck to make it more similar to another individual deck, or set of decks, you would remove a card that is not present in any/many of those decks, and replace it with a card that brings its count of that card more into line with those other decks. This could be done mathematically, by looking at the change in the chi-squared statistic for each possible change of cards (perhaps narrowing it down to sensible choices first, to make this computationally feasible).

The subject of measuring association between categorical variables is a large field, and I cannot give a full exposition here. The above chi-squared statistic is one example of a measure of association for categorical variables. More broadly, I would recommend that you compile your data into a contingency table of the above form, and select an appropriate measure of association for categorical variables, based on consideration of the available statistics in the statistical literature.


$^\dagger$ To simplify the problem, you might consider ignoring the basic land cards, in which case $n$ represents the number of cards that are not basic lands. These remaining cards have a deck limit of four under the rules, which limits the counts to the possible values $r_{i,j} = 0,1,...,4$.