The answers (definitions) defined on Wikipedia are arguably a bit cryptic to those unfamiliar with higher mathematics/statistics.
In mathematical terms, a statistical model is usually thought of as a
pair ($S, \mathcal{P}$), where $S$ is the set of possible
observations, i.e. the sample space, and $\mathcal{P}$ is a set of
probability distributions on $S$.In probability and statistics, a probability distribution assigns
a probability to each measurable subset of the possible outcomes of a
random experiment, survey, or procedure of statistical inference.
Examples are found whose sample space is non-numerical, where the
distribution would be a categorical distribution.
I am a high school student very interested in this field as a hobby and am currently struggling with the differences between what is a statistical model
and a probability distribution
My current, and very rudimentary, understanding is this:
-
statistical models are mathematical attempts to approximate measured distributions
-
probability distributions are measured descriptions from experiments that assigns probabilities to each possible outcome of a random event
confusion is further compounded by the tendency in literature to see the words "distribution" and "model" used interchangeably – or at least in very similar situations (e.g. binomial distribution vs binomial model)
Can someone verify/correct my definitions, and perhaps offer a more formalized (albeit still in terms of simple english) approach to these concepts?
Best Answer
Probability distribution is a mathematical function that describes a random variable. A little bit more precisely, it is a function that assigns probabilities to numbers and it's output has to agree with axioms of probability.
Statistical model is an abstract, idealized description of some phenomenon in mathematical terms using probability distributions. Quoting Wasserman (2013):
In many cases we use distributions as models (you can check this example). You can use binomial distribution as a model of counts of heads in series of coin throws. In such case we assume that this distribution describes, in simplified way, the actual outcomes. This does not mean that this is an only way on how you can describe such phenomenon, neither that binomial distribution is something that can be used only for this purpose. Model can use one or more distributions, while Bayesian models specify also prior distributions.
More formally this is discussed by McCullaugh (2002):
So statistical models use probability distributions to describe data in their terms. Parametric models are also described in terms of finite set of parameters.
This does not mean that all statistical methods need probability distributions. For example, linear regression is often described in terms of normality assumption, but in fact it is pretty robust to departures from normality and we need assumption about normality of errors for confidence intervals and hypothesis testing. So for regression to work we don't need such assumption, but to have fully specified statistical model we need to describe it in terms of random variables, so we need probability distributions. I write about this because you can often hear people saying that they used regression model for their data -- in most such cases they rather mean that they describe data in terms of linear relation between target values and predictors using some parameters, than insisting on conditional normality.
McCullagh, P. (2002). What is a statistical model? Annals of statistics, 1225-1267.
Wasserman, L. (2013). All of statistics: a concise course in statistical inference. Springer.