Unifrom sampling

self-learningstatistics

I came across the term "uniform subsampling" in this paper (page 5). I tried to understand the meaning of it but got no luck.

Here is the extract from the paper:

A naive approach to constructing coresets is based on uniform
subsampling of the data. For the sake of simplicity, consider a data
set $\mathcal{X}$ with constant weights $\mu_{\mathcal{X}}(x) = \frac{1}{|\mathcal{X}|}$ . For any query $Q \in \mathcal{Q}$, the
cost function in (2.1) may be rewritten as
$$
\operatorname{cost}(\mathcal{X}, Q)=\sum_{x \in \mathcal{X}} \frac{1}{|\mathcal{X}|} f_{Q}(x)=\mathbb{E}_{x}\left[f_{Q}(x)\right]
$$

where $x \in X$ is drawn uniformly at random.
Let the set C consist of m points sampled uniformly at random from $\mathcal{X}$
and set $\mu_C(x) = \frac{1}{m}$.

The closes I came across is the answer from here. The definition is:

If a sample is selected from a population which has been grouped into
strata, in such a way that the number of units selected from each
stratum is proportional to the total number of units in that stratum,
the sample is said to have been selected with a uniform sampling
fraction.

I would really appreciate it if someone can explain this concept to me.

Best Answer

If you look at Equation 2.3 of that paper, I believe they are describing uniform subsampling. You just draw samples from a data set with uniform probability. This will non-parametrically approximate the underlying distribution, and is often called boostrapping. In this case (Eq.2.3), they're using bootstrapping to approximate the expected value.

See the wiki entry: https://en.wikipedia.org/wiki/Bootstrapping_(statistics)

In python, you can get a uniform sample easily using libraries like numpy:

from numpy.random import uniform, randint

num_data = 100

//
// Get 10 samples in the range 0 to num_data
p = uniform(size=(10,)) * num_data
print(p.astype(int))

//
// Alternatively, you can just uniformly sample 
// random integers which is probably easier 
p = randint(low=0, high=num_data, size=(10,))
print(p)

//
// Now if you have data... you just get it like this
pretend_this_is_data = arange(100)
uniform_data_sample = pretend_this_is_data[p]
Related Question