[Math] k-means and random sampling

clusteringrandom variablessampling

I am self-studying some data clustering algorithms recently to solve a problem at work. Sometimes my data set is too big to perform k-means so I take a shortcut: (1) first randomly sample, without replacement, from the data set, then (2) perform k-means on the random sample, finally (3) assign the original data to the closest centroids found in step (2).

I am wondering if there is any performance guarantee, even probabilistically, about the above approach? For example, can we say something like:

(a) If we sample a fraction $f$ of the original data, then we sample all clusters with high probability (probability that depends on $f$);

(b) If we sample a fraction $f$ of the original data, then the resulting within-cluster-distances is at most a fraction $f$ of the within-cluster-distances of the original data.

It seems intuitive to me that the random sample is usually representative of my original data. So it seems okay to just perform k-means on the random sample. Of course with somewhat low probability I can get really unlucky to draw a bad sample. So I want to quantify this "low probability".

The closet theoretical result I can find is in Theorem 1 of this article (screenshot below). But the result assumes that the random sample is drawn with replacement (assumes $X_i$'s are independent). Can a similar (hopefully better) result be shown for random sample without replacement?

enter image description here

Intuitively sample without replacement should do better because the clusters that have not been sampled yet will have relatively higher probability of being drawn compared to the clusters that have been sampled. So random sample without replacement is less likely to miss a cluster with replacement. Even if no results can be shown, is my intuition making sense?

Thanks in advance for any guidance.

Best Answer

Firstly, I want to mention that what you're doing is quite reasonable. Usually it is called Minibatch $K$-means.

As for the theoretical guarantee applying when you do not sample with replacement, it will not exactly hold. The reason is that it violates the IID assumption (in particular, the independence of samples), so your sample will be biased.

Intuitively sample without replacement should do better because the clusters that have not been sampled yet will have relatively higher probability of being drawn compared to the clusters that have been sampled. So random sample without replacement is less likely to miss a cluster with replacement. Even if no results can be shown, is my intuition making sense?

This is true, but the statistical fact is that, when you hit the "unlikely" clusters, it is precisely because of the bias you've introduced. So the distribution of your minibatch is no longer guarantee to fairly represent the true data.

Example: imagine your dataset has 3 samples: the first two belong to cluster $A$ and the last to cluster $B$, and you draw two of them. Assume you drew one from cluster $A$ first. With replacement, the probability of hitting $B$ is 33% (which is the true value), whereas without replacement it is 50%. So the smaller cluster is over-represented, and any statistics are essentially being computed for a different population (albeit a nearby one).

In any case, let me end by saying that it probably does not matter at all. You have a very large dataset, so even if the sample is a little biased, it will be a very minor effect. What I mean is that for large $N$, each sample has probability of being chosen $1/N$ when sampling with replacement and after choosing $s$ samples, for a previously non-chosen sample, that probability becomes $ 1/(N-s) $. This is a perturbation in the discrete probability density of size $$ \delta = \frac{1}{N} - \frac{1}{N-s}=\frac{N-s - N}{N(N-s)} = \frac{-s}{N^2 - Ns} \in O(N^{-2})$$ which shrinks quadratically in $N$. Alternatively, for small $s=\epsilon N$, we get $$ |\delta| = \frac{\epsilon}{1-\epsilon}\frac{1}{N} $$ so even relatively small $N$ is fine, e.g. $N=10^4, \epsilon=10^{-2}$ gives $|\delta| \approx 10^{-6} $. Thus, for a given point, we are looking at a difference in probability of being chosen as $ |\delta|/N^{-1}=N|\delta| \approx 1\% $.

The independence assumption in the theorem is to avoid bizarre sampling schemes (like, "draw the first sample, then only choose samples smaller than it after that"). In your case, the difference is sufficiently minimal that I expect the same results to hold with only minor changes.

Nevertheless, in some cases, it is desirable to bias the sampling distribution. This is common in the field of rare event sampling. It is also common to do so for importance sampling. This actually makes sense in your case, to avoid missing any clusters. Importance is indeed generally used for reducing the variance caused by sampling, which is intuitively a good thing. However, the difference is that importance sampling requires a correction to avoid biasing the density from the non-independent sampling procedure. Perhaps that could be introduced here. But $K$-means is not really interested in density estimation, so I don't immediately see how to do this. Perhaps introducing a weighting factor into the $K$-means error function would work (i.e., downweight error involving the artificially rarer samples), or something like that.

Related Question