Solved – In layman’s terms what is the difference between a model and a distribution

The answers (definitions) defined on Wikipedia are arguably a bit cryptic to those unfamiliar with higher mathematics/statistics.

In mathematical terms, a statistical model is usually thought of as a
pair ($S, \mathcal{P}$), where $S$ is the set of possible
observations, i.e. the sample space, and $\mathcal{P}$ is a set of
probability distributions on $S$.

In probability and statistics, a probability distribution assigns
a probability to each measurable subset of the possible outcomes of a
random experiment, survey, or procedure of statistical inference.
Examples are found whose sample space is non-numerical, where the
distribution would be a categorical distribution.

I am a high school student very interested in this field as a hobby and am currently struggling with the differences between what is a statistical model and a probability distribution

My current, and very rudimentary, understanding is this:

statistical models are mathematical attempts to approximate measured distributions
probability distributions are measured descriptions from experiments that assigns probabilities to each possible outcome of a random event

confusion is further compounded by the tendency in literature to see the words "distribution" and "model" used interchangeably – or at least in very similar situations (e.g. binomial distribution vs binomial model)

Can someone verify/correct my definitions, and perhaps offer a more formalized (albeit still in terms of simple english) approach to these concepts?

Best Answer

Probability distribution is a mathematical function that describes a random variable. A little bit more precisely, it is a function that assigns probabilities to numbers and it's output has to agree with axioms of probability.

Statistical model is an abstract, idealized description of some phenomenon in mathematical terms using probability distributions. Quoting Wasserman (2013):

A statistical model $\mathfrak{F}$ is a set of distributions (or densities or regression functions). A parametric model is a set $\mathfrak{F}$ that can be parameterized by a finite number of parameters. [...]

In general, a parametric model takes the form

$$ \mathfrak{F} = \{ f (x; \theta) : \theta \in \Theta \} $$

where $\theta$ is an unknown parameter (or vector of parameters) that can take values in the parameter space $\Theta$. If $\theta$ is a vector but we are only interested in one component of $\theta$, we call the remaining parameters nuisance parameters. A nonparametric model is a set $\mathfrak{F}$ that cannot be parameterized by a finite number of parameters.

In many cases we use distributions as models (you can check this example). You can use binomial distribution as a model of counts of heads in series of coin throws. In such case we assume that this distribution describes, in simplified way, the actual outcomes. This does not mean that this is an only way on how you can describe such phenomenon, neither that binomial distribution is something that can be used only for this purpose. Model can use one or more distributions, while Bayesian models specify also prior distributions.

More formally this is discussed by McCullaugh (2002):

According to currently accepted theories [Cox and Hinkley (1974), Chapter 1; Lehmann (1983), Chapter 1; Barndorff-Nielsen and Cox (1994), Section 1.1; Bernardo and Smith (1994), Chapter 4] a statistical model is a set of probability distributions on the sample space $\mathcal{S}$. A parameterized statistical model is a parameter $\Theta$ set together with a function $P : \Theta \rightarrow \mathcal{P} (\mathcal{S})$, which assigns to each parameter point $\mathcal{\theta \in \Theta}$ a probability distribution $P \theta$ on $\mathcal{S}$. Here $\mathcal{P}(\mathcal{S})$ is the set of all probability distributions on $\mathcal{S}$. In much of the following, it is important to distinguish between the model as a function $ P : \Theta \rightarrow \mathcal{P} (\mathcal{S}) $, and the associated set of distributions $P\Theta \subset \mathcal{P} (\mathcal{S})$.

So statistical models use probability distributions to describe data in their terms. Parametric models are also described in terms of finite set of parameters.

This does not mean that all statistical methods need probability distributions. For example, linear regression is often described in terms of normality assumption, but in fact it is pretty robust to departures from normality and we need assumption about normality of errors for confidence intervals and hypothesis testing. So for regression to work we don't need such assumption, but to have fully specified statistical model we need to describe it in terms of random variables, so we need probability distributions. I write about this because you can often hear people saying that they used regression model for their data -- in most such cases they rather mean that they describe data in terms of linear relation between target values and predictors using some parameters, than insisting on conditional normality.

McCullagh, P. (2002). What is a statistical model? Annals of statistics, 1225-1267.

Wasserman, L. (2013). All of statistics: a concise course in statistical inference. Springer.

Best Answer

Related Solutions

Related Question