Sample Independence – Are Unpaired Samples Always Generated by Independent Random Variables?

independencesampleterminology

I am confused by the terminology because the term "independent samples" is usually used as a synonym for "unpaired samples". Does this fact mean that unpaired samples are always generated by independent random variables (assuming we have a probability model for our data)?

In other words, let's assume that we have a probability model for our data, that is we have two random variables: $X \sim F_X$, $Y \sim F_Y$. These two random variables correspond to two populations. Next, we consider two i.i.d. random samples $(X_1, \ldots, X_n) \overset{\text{iid}}{\sim} F_X$, $(Y_1, \ldots, Y_m) \overset{\text{iid}}{\sim} F_Y$ where there is no one element of $(X_1,\ldots,X_n)$ such that we can "match" it with some element of $(Y_1, \ldots, Y_m)$. We call such samples unpaired samples (of random variables).
Is it true that $X_i$ and $Y_j$ are independent random variables, $\forall i,j$? Or, equivalently, is it true that $X$ and $Y$ are independent random variables?


P.S. It is well known that paired samples (often called dependent samples) can be generated either by dependent or by independent jointly distributed random variables $X$ and $Y$ (see examples 1 and 2 below). But my question above is about unpaired samples.

Example 1. We have $n$ random people and measure the same characteristic (for example, body weight) in two different moments of time, this will give us $n$ pairs of numbers $(x_1,y_1), \ldots, (x_n,y_n)$. Here we can treat these numbers as realization of two paired samples (of random variables) $(X_1,\ldots,X_n) \overset{\text{iid}}{\sim} F_X$ and $(Y_1,\ldots,Y_n)\overset{\text{iid}}{\sim} F_Y$, which were generated by two dependent jointly distributed random variables $X$ and $Y$.
Example 2. We have $n$ random people and measure two totally different, unrelated characteristics (for example, year of birth and gender), this gives us $n$ pairs of numbers $(x_1,y_1),…,(x_n,y_n)$. Here we can treat these numbers as realization of two paired samples (of random variables) $(X_1,\ldots,X_n) \overset{\text{iid}}{\sim} F_X$ and $(Y_1,\ldots,Y_n) \overset{\text{iid}}{\sim} F_Y$, which were generated by two independent jointly distributed random variables $X$ and $Y$.

If we have any doubts, we can use the chi-squared test of independence to find out if paired samples $(X_1,\ldots,X_n)$ and $(Y_1,\ldots,Y_n)$ are generated by dependent or independent jointly distributed random variables $X$ and $Y$.

Best Answer

I am confused by the terminology because the term "independent samples" is usually used as a synonym for "unpaired samples". Does this fact mean that unpaired samples are always generated by independent random variables (assuming we have a probability model for our data)?

No, 'unpaired data' is not always independent.

The answer below gives is first an interpretation of how 'unpaired' relates to independence. After that, it gives two examples of how two samples can still be dependent, even when there is no pairing.

Unpairing data

Yes, you do have that a set of pairs of data lose their dependency when you switch the labeling.

The example below shows what happens when we remove the pairing of two correlated variables.

See that point at the top in the left graph, around $x,y = 2,2.5$, if you have unpaired data, then the x-coordinate that is matched with this $y = 2.5$ can suddenly be anything from the distribution of $x$ values.

intuitive graph


An unpaired way of dependency

Independence between samples occurs if the outcomes of the two variables are unrelated. So if the probability distribution $f_Y(y)$ is not dependent on the $X_i$ and vice versa if the probability distribution $f_X(x)$ is not dependent on the $Y_i$.

Gathering samples pairwise, such that variables might have some relation, is one practical setting where variables might have a dependency. Due to being sampled within the same unit (e.g. same time, person, place, etc.) the probability distribution of the one element in the pair can be depending on the value of the other element in the pair.

But, there are other ways in which the sample $X$ could influences the density $f_Y(y)$, yet not in terms of a pairwise relationship (or generalized multiple comparisons beyond the number pair/two).

For instance, the parameters in $f_Y(y)$ could depend on $\sum X_i$.

Example consider two i.i.d. random samples of size $n$

$$\begin{array}{rcl} (X_1, \ldots, X_n) & \overset{\text{iid}}{\sim} & N(0, 1) \\ (Y_1, \ldots, Y_n) & \overset{\text{iid}}{\sim} & N(\mu,\sigma^2) \end{array}\\ \text{with $\mu = \frac{1}{n}\sum_{i=1}^{n}{X_i}$ and $\sigma^2 = \frac{1}{n}\sum_{i=1}^{n}{(X_i-\mu)^2}$} $$

Non-explicit paired data but related

It might also be that you have two variables that are not explicitly paired, and are not stated as 'paired data', but are dependent when they are combined together based on additional metadata. For example recordings of cloudiness and recordings of rainfall from two different datasets can be 'paired' based on the date and time.

I admit that this point is a bit semantic. But it is just to prevent people from taking data from different data sets, e.g. twitter messages from Donald Trump or Elon Musk, and daily positions of the stock exchange, and assume that there is no dependency if there is no explicit pairing (the pairing is not clear since the data has different dimensions, but you can still relate the data samples in some more complex way than pairing).

Related Question