Solved – What the relation between a random variable and a sample (or dataset) in machine learning

machine learningrandom variablesampleterminology

I'm having trouble with the machine learning vocabulary, especially with the concept of random variables.

Given a sample $X$ (with features $x_1, x_2, \dots, x_n$) that you train your algorithm on (or predict), what is the random variable? Is it $X$? Or is it any of its feature $x_1, x_2, \dots, x_n$?

Quoting The Deep Learning book (by Ian Goodfellow):

A random variable is a variable that can take on different values randomly. We typically denote the random variable itself with a lowercase letter in plain typeface

Where as in the definition of a Random Variable in the Wikipedia article:

A random variable is a measurable function from a set of possible outcomes to a measurable space E

Then we also have the definition of multivariate random variable

A multivariate random variable is a column vector (or its transpose, which is a row vector) whose components are scalar-valued random variables on the same probability space as each other.

Should a sample be in fact considered as a multivariate random variable?

Best Answer

A dataset is a sample of the population. A dataset is not a random variable, which has a precise mathematical definition, i.e. it is a (measurable) function (you can ignore the "measurable" adjective!): you can simply think of a random variable as a map between outcomes of e.g. an experiment and real numbers.

A dataset (or a sample) contains $N$ "realisations" (or "outcomes") of one or more random variables, where $N$ is the size of the dataset. More precisely, each value associated with each feature of the dataset is (usually) or can be associated with one random variable: when you sample from the population (i.e., you get one row of the dataset), each of the random variables is "realized", that is, you obtain one of the concrete outcomes that the associated random variable can take.

A dataset can thus be considered the realization(s) of a random variable, if there is only one feature, or of multiple random variables (or, equivalently, of a multi-dimensional or multi-variate random variable), if there are multiple features.

In your example, if $X=(x_1, x_2, \dots, x_M)$ is one row of your dataset, then $X$ can be associated with the "realization" of $M$ random variables: $x_1, x_2, \dots, x_M$ are the realizations of these $M$ random varables (which we can denote by $X_1, X_2, \dots, X_M)$, that is, they are the outcomes of these random variables. You may have more than one row or, equivalently, maybe $X$ is actually a matrix, where $x_i$ is a column vector of size $N$ (i.e. the size of the dataset) which contains $N$ realizations of the random variable associated with $x_i$ (which can be denoted by $X_i$).

In conclusion, the term "feature" is (usually) equivalent to the term "random variable". A random variable (or feature) can have several realisations (or outcomes).

Related Question