Firstly, I want to mention that what you're doing is quite reasonable. Usually it is called Minibatch $K$-means.
As for the theoretical guarantee applying when you do not sample with replacement, it will not exactly hold. The reason is that it violates the IID assumption (in particular, the independence of samples), so your sample will be biased.
Intuitively sample without replacement should do better because the clusters that have not been sampled yet will have relatively higher probability of being drawn compared to the clusters that have been sampled. So random sample without replacement is less likely to miss a cluster with replacement. Even if no results can be shown, is my intuition making sense?
This is true, but the statistical fact is that, when you hit the "unlikely" clusters, it is precisely because of the bias you've introduced. So the distribution of your minibatch is no longer guarantee to fairly represent the true data.
Example: imagine your dataset has 3 samples: the first two belong to cluster $A$ and the last to cluster $B$, and you draw two of them. Assume you drew one from cluster $A$ first. With replacement, the probability of hitting $B$ is 33% (which is the true value), whereas without replacement it is 50%. So the smaller cluster is over-represented, and any statistics are essentially being computed for a different population (albeit a nearby one).
In any case, let me end by saying that it probably does not matter at all. You have a very large dataset, so even if the sample is a little biased, it will be a very minor effect. What I mean is that for large $N$, each sample has probability of being chosen $1/N$ when sampling with replacement and after choosing $s$ samples, for a previously non-chosen sample, that probability becomes $ 1/(N-s) $. This is a perturbation in the discrete probability density of size $$ \delta = \frac{1}{N} - \frac{1}{N-s}=\frac{N-s - N}{N(N-s)} = \frac{-s}{N^2 - Ns} \in O(N^{-2})$$
which shrinks quadratically in $N$. Alternatively, for small $s=\epsilon N$, we get
$$ |\delta| = \frac{\epsilon}{1-\epsilon}\frac{1}{N} $$
so even relatively small $N$ is fine, e.g. $N=10^4, \epsilon=10^{-2}$ gives
$|\delta| \approx 10^{-6} $. Thus, for a given point, we are looking at a difference in probability of being chosen as $ |\delta|/N^{-1}=N|\delta| \approx 1\% $.
The independence assumption in the theorem is to avoid bizarre sampling schemes (like, "draw the first sample, then only choose samples smaller than it after that"). In your case, the difference is sufficiently minimal that I expect the same results to hold with only minor changes.
Nevertheless, in some cases, it is desirable to bias the sampling distribution. This is common in the field of rare event sampling. It is also common to do so for importance sampling. This actually makes sense in your case, to avoid missing any clusters. Importance is indeed generally used for reducing the variance caused by sampling, which is intuitively a good thing.
However, the difference is that importance sampling requires a correction to avoid biasing the density from the non-independent sampling procedure. Perhaps that could be introduced here. But $K$-means is not really interested in density estimation, so I don't immediately see how to do this. Perhaps introducing a weighting factor into the $K$-means error function would work (i.e., downweight error involving the artificially rarer samples), or something like that.
I would like to describe how good an algorithm is clustering my data.
This is itself a common problem in clustering, so you could use some of the metrics already developed.
The idea is to ensure clusters are cohesive (intra-cluster distances are small) and distinctive (inter-cluster differences are large).
Personally I've used the Davies-Bouldin Index and the Silhouette index before.
Check out this thread and this one for more measures.
Take for example a two dimensional example. Basically I am most interested in the area of the intersections seen in the second graph. Unfortunately it is not that simple to create an n-dimensional "figure" without introducing noise. (or is there a way?)
Since you have only a point set rather than some kind of $n$-dimensional polygon, it seems natural to measure the overlap in terms of the empirical probability distribution implied by the point set, rather than literally the volumetric overlap of the point sets.
(This is what your Bhattacharyya coefficient idea does).
Nevertheless, if you want to try to literally measure the area overlap geometrically, we can try the following. The idea is measure the overlap of the convex hulls of the clusters. Since high-dimensional convex hull computation is non-trivial and also non-visualizable, let's use a manifold learning algorithm to do dimensionality reduction first, specifically say random projections.
Basically, the overlap would be $$ V(C_1,C_2) = \frac{1}{R}\sum_{r=1}^{R} \text{AreaOfOverlap}(\text{ConvexHull}(R_r(C_1)), \text{ConvexHull}(R_r(C_2))) $$
where $R_r$ is the $r$th random projection. Note this is stochastic, but geometric rather than probabilistic. Assuming $R_r$ maps to 2D, we can use a standard convex hull calculator. To compute the overlap area, a simple way is to "grid" the plane, and then "rasterize" the convex hulls to create two binary images (like pixelized projection silhouettes), then simply count the number of pixels (bins/grid cells) they share in common. Another way is to decompose each hull into a set of triangles (i.e., obtain a triangle mesh per hull), and then compute the overlapping area between each pair of triangles.
Another approach would be to look at the distributions of the points which are assumed to be normal distributed. We can create a multivariate (for each dimension) normal distribution for each cluster. Could I use the Bhattacharyya distance for multivariate distributions to calculate the similarity and take that as a measure? I don't know if this is even a good approximation for the "overlap".
This is a totally reasonable thing to do. Once you have fitted multivariate normals to each cluster, there are many distances/similarity metrics you could use, some of which can be written analytically: Bhattacharyya coefficient, Frechet distance, KL or Jensen-Shannon divergence, Hellinger distance, etc... A distance can always be made a similarity via e.g. $T_1(d) = 1/(1+d)$ or $T_2(d)=\exp(-d)$.
The main problem here is how well the cluster is actually fit by a multivariate normal. For instance, consider a very curved cluster (e.g., that looks like a crescent moon). The normal may struggle to really represent it well, in which case the overlap will be over/underestimated. Indeed, for clustering algorithms in general, such as $k$-means, it is always a concern whether it is biased towards clusters of a particular shape (e.g., see here)
Of course we could also look at the points instead of the distributions. For this approach I have no idea yet.
Well firstly, any of the probabilistic measures just above, like the Bhattacharyya distance $D_B$, can be computed on the empirical distributions without fitting a normal distribution. For instance, we can write $$ D_B(p,q) = \mathbb{E}_{x\sim p}\sqrt{\frac{q(x)}{p(x)}} \approx \frac{1}{|C_p|}\sum_{x\in C_p}\sqrt{\frac{\widehat{q}(x)}{\widehat{p}(x)}} $$
where $p$ is the probability distribution associated to cluster $C_p$ and $\widehat{p}$ is the empirical density function estimated via e.g. kernel density estimation.
The benefit of this approach is that you no longer need to assume normality; the downside is that density estimation is hard, especially in high dimensions (though you can always reduce dimensionality with e.g. PCA first).
Another way to compute a distance between probability distributions is via the Wasserstein distance $W$ (also called the Earth Mover's distance). This can be computed directly on the point set, thus avoiding density estimation. I think this one is very intuitive since it behaves well (smoothly) as the clusters move apart, which some $f$-divergences will not. There are also well-studied algorithms for computing it.
Best Answer
The cluster centroid is the mean of all data points assigned to that cluster. The variable idx will tell you which cluster each data point was assigned to. Based on this, you can compute the mean of all points in cluster $i$ after removing the point that you are going to move to cluster $j$. Similarly, you can compute the mean of all points in cluster $j$ after adding in the new point. These are your updated centroids.