Solved – the difference between random variable and random sample

mathematical-statisticsrandom variablesampleterminology

These two expressions confused me a lot when I was learning statistics.
It seems to me that they are totally different things.

A random sample is to randomly take a sample from a population, whereas a random variable is like a function that maps the set of all possible outcomes of an experiment to a real number.

However, say if I draw some samples $X_1$, $X_2$, $X_3$ and $X_i \sim N(\mu,\sigma^2)$, where $\mu$ and $\sigma$ are unknown, are $X_1$, $X_2$, $X_3$ random samples or random variables?

Best Answer

A random variable, $X:\Omega \rightarrow \mathbb R$, is a function from the sample space to the real line. This is a deterministic formula that can be as simple as writing down the number a die lands on in the random experiment of tossing a die. The experiment is random, in the way that we don't control many of the physical factors determining its outcome; however, as soon as the die lands the random variable maps the outcome in the physical world to a number.

Other examples would include measuring the height of a sample of eight graders, perhaps to infer the population parameters (including mean and variance). Each boy or girl would be the outcome of a random experiment, pretty much like tossing a coin. Once a subject is selected, the actual mapping to a number in inches or centimeters is not subject to randomness, despite its name of "random variable."

A group of such experiments would constitute a sample: "In statistics, a simple random sample is a subset of individuals (a sample) chosen from a larger set (a population)." This definition is intuitive, but leaves the term population implicit. An attempt at fixing this gap is made in this paper, pointing out that 'the term “population” as a noun should refer to the sample space, not the random variable as is the case in many textbooks."

A random sample is a collection of $n$ independent and identically distributed (i.i.d.) random variables $X_1, X_2, X_3,\dots, X_n.$ in which ${\displaystyle X_{i}}$ is the function $X(\cdot)$ applied to the outcome of the $i$-th experiment: ${\displaystyle x_{i}=X_{i}(\omega )}.$ Although sampling without replacement doesn't fulfill the independence requirement, this point is overlooked when sampling from a large population in favor of computational expediency.

The $n$-tuples $x_1,x_2,x_3,\dots,x_n$ are particular realizations of the random variables, which in the case proposed in the question, would be drawn from $N(\mu,\sigma^2)$ identically distributed $X_i$ random variables. So in the OP the process of "drawing some samples" would result in individual realizations of this collection of random variables.

Random variables are the object of mathematical laws, such as the LLN or the CLT. The distribution of the random variable will dictate the feasibility of induction from random samples. For example, any given realizations will always have a mean and a standard deviation as an $n$-tuple or real numbers, yet their generating random variables may not have finite moments, e.g. Pareto, compromising statistical inference about the population characteristics.