Solved – the difference between sample and outcome? (plus events and observations)

definitionsampleterminology

I was sure that these are the same things but I do not get the difference reading about mass probability function

Suppose that $X: S → A (A \subseteq R)$ is a discrete random variable
defined on a sample space S. Then the probability mass function $f_X: A → [0, 1]$ for X is defined as

$$f_X(x) = \Pr(X = x) = \Pr(\{s \in S: X(s) = x\})$$

Thinking of probability as mass helps to avoid mistakes since the
physical mass is conserved as is the total probability for all
hypothetical outcomes ''x'': $\sum_{x\in A} f_X(x) = 1$

Well, I just do not get this

The sample space of an experiment is the set of all
possible outcomes of that experiment

Sample set, the set of samples is the same as set of outcomes, so we can also call it outcome set. I just do not get why to denote it by two different letters A and S above if these are the same things and why does the function maps the samples to outcomes if it is the same thing?

Is outcome a value of the sample? What is the value of the sample then? The functions map values to values. I need a value (of sample) so that function (aka random variable) could produce the value of outcome.


Edit1 I would like to thank who started to elaborate my question. It is not answered, however. I cannot accept it unless the following is resolved.

I did not get how events/observations are different from samples/outcomes. Also, I see that you identify samples with any type whereas outcomes are values (you wanted to say num types instead since any type also has values and variables). We, programmers, write functions in terms of arguments/variables. For readability and reliability, we constrain every variable to some type. The values/arguments are placeholders that take concrete values during realization at runtime. The types simply constrain the domain of the variables (types are the domains/set of available values for that variable/argument). I think that it is sorta the same in math, you just skip the name of variable when define function like $f: Int \rightarrow Real$. So, by 'anything', you mean 'value of any, not necessarily numerical, type'.

It remains unclear however what is the point of introducing the S domain (aka separating samples and outcomes). You take heads/tails and convert it into a number. Why not to sample real numbers right away (identify S with A, why heads/tails, how introducing S improves our understanding of mass probability function) or why not to make the conversion chain even longer introducing $S_{interm}$ for instance, so that you could sample and use a series of random variables to turn the sample into A, e.g. $S \xrightarrow {randomvar_1} S_{interm} \xrightarrow {randomvar_2} A$? Why not to say that I have 3 pennies and 1 dime and sample A={1,10} right away? Why play around with heads/tails instead?

It is also not clear why 'random variables' appear in the S -> A stage rather than in the sampling, to obtain values of S? Does it mean that generation of heads/tails is deterministic whereas mapping them to real domain, {1,10}, is random?

Back to the sample=outcome. I see that Wikipedia says
similarly

A random variable is a real-valued function defined on a set of
possible outcomes, the sample space Ω.

So, how can it be that random variable maps samples to outcomes if the outcomes are the domain rather than range of random variables? I think that all this confusion deserves clarification.

Best Answer

The word "sample" causes at least two different instances of confusion.

A (what the OP asks about)

The tag "Sample" here on CV starts by "A sample is a subset of a population": all possible elements included in any possible subset of a population can only be an event that is possible: hence the set of all possible events, can be called the "Sample Space" (the "Population Subsets Space"), because it is from that Space that the elements of any population subset can come.

Where does that leave us regarding the relation with the concept "outcomes"?

The population and its subsets do not consist of the numerical values that the elements of these subsets may take: these numerical values are assigned by the random variable that we have defined according to our needs.

To consider the trivial example, a series of coin-flips can be thought as a population of heads and tails. We define a real-valued random variable by, say, linking "Heads" with the number $5$ and "Tails" with the number "$17$". So the Sample Space will be "{Heads, Tails}", which will be the domain of the random variable, while the "outcome space", its range, will be $\{5,17\}$.

In other words, it is not necessary that "the function maps values to values" as the OP states. It can map anything to values.

And strictly speaking, a "sample" of, say size $3$ will be a set like "{Heads, Heads, Tails}", and not the set $\{5,5,17\}$. This latter set is produced by a specific random variable. Obviously, we could use another random variable and obtain a different numerical representation for the same sample.

In all, the Sample Space can be non-numerical while the "set of outcomes" of a real-valued random variable should be real-valued. To each realized sample from a population we can map infinitely many numerical sets. It is by no accident that the latter are properly called "a sample of realizations of a random variable", and not just "a sample from a population".

Assume now that we have a coin where on the one side it reads "$1$" while on the other it reads "$2$". So the Sample Space here has a numerical nature. Still we can define a random variable by mapping $1$ to $5$ and $2$ to $17$. Here too, the Sample Space $\{1,2\}$ will be different than the "Outcome Space" $\{5,17\}$.

Our sample of size $3$ (understood as a subset of the population) will here be the set $\{1,1,2\}$, while the "sample of realizations of the (specific) random variable" will be $\{5,5,17\}$.

B: Sample and Observation
In fields like medicine or biology, when we say "let's take a sample of blood", we mean "let's take blood once". If we wanted to put this in general statistical terminology, we would have one observation... because in general statistical terminology a "sample" is a set containing usually more than one observation (although it can contain only one).

So when somebody from these fields will say "I have available $n$ samples" - he just might mean, in general terminology, "I have available $n$ observations" or "I have available one sample of $n$ observations" -but someone else that is used in the more standard terminology, by the expression "I have available $n$ samples", she will understand "I have available $n$ sets each containing $m$ observations" -and usually $m\geq 1$. One can find this sort of confused communication in various posts here on CV.

ADDENDUM
Responding to the OP's edit in the question:
"Why not sample real numbers right away"? Because the world is not made by numbers. Actual data collection that describes the world is in many cases of qualitative nature. So, "separating samples and outcomes" follows the nature of things. Moreover, the act of mapping them to numerical values is a separate step, and as I have already mentioned, it is not a unique mapping. So it requires decisions to be made. And whenever decisions are involved, they better be clear and transparent so that they can be judged, assessed, and criticized. These "decisions" are, to begin with, the choice of the random variable we will use.

"Heads and Tails" exist irrespective of whether we want to study them. The "random variable" is a mathematical concept/tool which we project onto the real-world data in order to analyze and study them. So, samples, they exist. Random variables, they transform samples into something that we can handle using quantitative methods.

As to whether "samples are deterministic", nobody has ever decisively argued of whether there exists anything inherently stochastic in nature, or whether all our stochastic approaches are just a reflection of our ignorance, and/or of the limits of our measuring devices.