Data Similarity – How to Determine How Similar Groups of Data Are to Each Other

distributionsprobabilitysubset

I have a pool of 15,000 unique* items. From that set, there are roughly 3200 sets of 60 items, selected non-randomly (in fact, for the sake of the framing of the question, let's say that each item is intentionally chosen for a specific reason). I have the list of all items that are in at least one set; said list is about 5500 items long, and also includes the total number of sets each item is in. The most popular item is in roughly 60% of the sets, and the list goes all the way down to items that only appear in 1 set.

Additionally, there are similarities that exist between the different sets that manifest themselves in the form of tendencies. For instance (with random values given), 30% of the sets are similar according to one "tendency group", and tend to include items #1, #50, #2006, etc. more often than other tendency groups; 20% might be similar to each other in a different way, and tend to include items #65, #700, #5000, etc. more often than other tendency groups. Thus, there's not exactly one set of "expected values" (which would make this question similar to this one), but rather upwards of a dozen groups of "somewhat similar values." I looked also at this question, but, as someone with a limited grasp of the subject, it doesn't seem like it's exactly what I'm looking for.

How do I compare each set to each other in such a way that I'm able to compare each item from a given set and see how it compares to other items that appear in similar sets?

Specifically, if given a specific set, if I were to replace one item, which item would I take out, and what would I replace it with, to make the set more similar (less unique) compared to other sets. Extra points if the method is capable of making the set more similar to one specific (or perhaps simply being selective towards one) tendency group (or more).

In typing out this question, I've become aware of the Jaccard index, but that seems like it's more suited for comparing two sets of data to each other, not looking at an individual item across multiple sets and looking for similarities in that respect, but I could very much be wrong.

*I say unique; even though they may differ in one aspect, some items functionally or even have exactly the same characteristics except for their name, but I'm not sure how much that matters.

Edit1: the unique items are Magic: the Gathering cards, and the sets represent individual decks. Does that help?

Edit2: Since posting this question, I have become aware of the existence of "market basket" analysis, but, as a laymen in this field, I have no idea where to even begin to conduct said analysis.

Best Answer

It is unclear from your question exactly what your objective is for this analysis, but if I understand correctly, you would like to be able to quantify the similarity (or "distance") between two MtG decks, so that you can alter a deck to make it more or less similar to other decks in the sample. As I will show below, this exercise is essentially equivalent to asking how we can measure "similarity" or "distance" between categorical random variables falling over some fixed set of categories.

Suppose you have $n$ cards and $m$ decks and let $r_{i,j}$ represent the count of card $i$ in deck $j$.$^\dagger$ You can hold all of these count values in an $n \times m$ matrix which is the contingency table for the data:

$$\mathbf{R} = \begin{bmatrix} r_{1,1} & r_{1,2} & \cdots & r_{1,m} \\ r_{2,1} & r_{2,2} & \cdots & r_{2,m} \\ \vdots & \vdots & \ddots & \vdots \\ r_{n,1} & r_{n,2} & \cdots & r_{n,m} \\ \end{bmatrix}.$$

Each column $\mathbf{r}_{j}$ represents a single deck, giving the count values for each card in the deck (most of which will be zeros). One useful way to think of a deck is as a categorical random variable giving a random card from that deck. Let $X_j$ be a random card from deck $j$, which has distribution:

$$X_j \sim \text{Categorical}(\boldsymbol{p}_j) \quad \quad \quad \boldsymbol{p}_j = \frac{\mathbf{r}_j}{r_{\bullet j}}.$$

Since each deck is represented by a categorical random variable, we can employ statistical measures of association between categorical random variables. For example, the chi-squared measure of association between decks $j$ and $j'$ is obtained by looking at the $n \times 2$ contingency table for only those two decks:

$$\chi^2 = \sum_{i=1}^n \Bigg[ \frac{(r_{\bullet \bullet} r_{i,j} - r_{i \bullet} r_{\bullet j})^2}{r_{\bullet \bullet} r_{i \bullet}r_{\bullet j}} + \frac{(r_{\bullet \bullet} r_{i,j'} - r_{i \bullet} r_{\bullet j'})^2}{r_{\bullet \bullet} r_{i \bullet} r_{\bullet j'}} \Bigg].$$

This is a standard measure of association for categorical random variables, so it will be easy for others to recognise, and its properties are well-known. It can be extended to get a more general measure of the similarity between a set of decks, by enlarging the contingency table to include the relevant set of decks. In order to alter a deck to make it more similar to another individual deck, or set of decks, you would remove a card that is not present in any/many of those decks, and replace it with a card that brings its count of that card more into line with those other decks. This could be done mathematically, by looking at the change in the chi-squared statistic for each possible change of cards (perhaps narrowing it down to sensible choices first, to make this computationally feasible).

The subject of measuring association between categorical variables is a large field, and I cannot give a full exposition here. The above chi-squared statistic is one example of a measure of association for categorical variables. More broadly, I would recommend that you compile your data into a contingency table of the above form, and select an appropriate measure of association for categorical variables, based on consideration of the available statistics in the statistical literature.

$^\dagger$ To simplify the problem, you might consider ignoring the basic land cards, in which case $n$ represents the number of cards that are not basic lands. These remaining cards have a deck limit of four under the rules, which limits the counts to the possible values $r_{i,j} = 0,1,...,4$.

Related Solutions

Solved – How to compare or normalize two sets of probabilities

Great problem!!

And the long + short version is a very good way of describing the problem!! (+1)

I would use condtional probabilities, which may require some computation on the fly. Following up on your example, you would normalize the triplets by the total probability of 'one of t' i.e. the sum of the probabilities of all the triplets that start by 'one of t' do the same for the singlets, ie the sum of all the probabilities of the singlets that start with 't'. This should scale your probabilities nicely and in a meaningful way.

So, to sum up, the updated probability of a word starting with $abc$ (say) is:

$$ P^*(word|abc) = \frac{P(word)}{ \sum_{w \in A} P(w) },$$

where $A$ is the set of words that start with the letters $abc$.

Solved – Probability of intersection from multiple sampling of the same population

Compute the chances recursively.

Let $p_s(x)$ be the probability that exactly $x$ values, $0 \le x \le k$, are selected in all $s\ge 1$ independent draws of $k$ items (without replacement) from a population of $n \ge k \gt 0$ members. (Let's hold $n$ and $k$ fixed for the duration of the analysis so they don't have to be mentioned explicitly.)

Let $p_s(x\mid y)$ be the probability that if exactly $y$ values are selected in the first $s-1$ draws, then $x \le y$ of them are selected in the last draw. Then because there are $\binom{y}{x}$ subsets of $x$ elements of those $y$ elements, and $\binom{n-y}{k-x}$ subsets of the remaining $k-x$ elements are separately selected out of the other $n-y$ members of the population,

$$p_s(x\mid y) = \frac{\binom{y}{x}\binom{n-y}{k-x}}{ \binom{n}{k}}.$$

The law of total probability asserts

$$p_s(x) = \sum_{y=x}^k p_s(x\mid y) p_{s-1}(y).$$

For $s=1$, it's a certainty that $x=k$: this is the starting distribution.

The total computation needed to obtain the full distribution up through $s$ repetitions is $O(k^2 s)$. Not only is that reasonably quick, the algorithm is easy. One pitfall awaiting the unwary programmer is that these probabilities can become extremely small and underflow floating-point calculations. The following R implementation avoids this by computing the values of $\log(p_s(x))$ in columns $1, 2, \ldots, s$ of an array.

lp <- function(s, n, k) {
  P <- matrix(NA, nrow=k+1, ncol=s, dimnames=list(0:k, 1:s))
  P[, 1] <- c(rep(-Inf, k), 0)
  for (u in 2:s) 
    for (i in 0:k) {
      q <- P[i:k+1, u-1] + lchoose(i:k, i) + lchoose(n-(i:k), k-i) - lchoose(n, k)
      q.0 <- max(q, na.rm=TRUE)
      P[i+1, u] <- q.0 + log(sum(exp(q - q.0)))
    }
  return(P)
}
p <- function(...) zapsmall(exp(lp(...)))

The answer to the question is obtained by letting $s=5,$ $n=10000=10^4$, and $k=100=10^2$. The output is a $101\times 5$ array, but most of the numbers are so small we may focus on very small $x$. Here are the first four rows corresponding to $x=0,1,2,3$:

p(5, 1e4, 1e2)[1:4, ]

The output is

  1         2         3      4        5
0 0 0.3641945 0.9900484 0.9999 0.999999
1 0 0.3715891 0.0099034 0.0001 0.000001
2 0 0.1857756 0.0000481 0.0000 0.000000
3 0 0.0606681 0.0000002 0.0000 0.000000

Values of $x$ label the rows while values of $s$ label the columns. Column 5 shows the chance that one element appears in all five samples is minuscule (about one in a million) and there's essentially no chance that two or more elements appear in all five samples.

If you would like to see just how small these chances are, look at their logarithms. Base 10 is convenient and we don't need many digits:

u <- lp(5, 1e4, 1e2)[, 5]
signif(-u[-1] / log(10), 3)

The output tells us how many zeros there are after the decimal point:

    1     2     3     4     5     6     7     8     9    10  ...   97    98    99   100 
  6.0  12.3  18.8  25.5  32.3  39.2  46.2  53.2  60.4  67.6 ... 917.0 933.0 949.0 967.0

Numbers in the top row are values of $x$. For instance, the chance of exactly three values showing up in all five samples is found by computing exp(u[4]), giving $0.000\,000\,000\,000\,000\,000\,1434419\ldots$ and indeed this has $18$ zeros before the first significant digit. As a check, the last value $967.0$ is a rounded version of $967.26$. $\binom{10000}{100}^{-4}$ (which counts the chances that the first sample reappears in the next four samples) equals $10^{-967.26}.$

Best Answer

Related Solutions

Solved – How to compare or normalize two sets of probabilities

Solved – Probability of intersection from multiple sampling of the same population

Related Question