Probability distribution is a mathematical function that describes a random variable. A little bit more precisely, it is a function that assigns probabilities to numbers and it's output has to agree with axioms of probability.
Statistical model is an abstract, idealized description of some phenomenon in mathematical terms using probability distributions. Quoting Wasserman (2013):
A statistical model $\mathfrak{F}$ is a set of distributions (or
densities or regression functions). A parametric model is a set
$\mathfrak{F}$ that can be parameterized by a finite number of
parameters. [...]
In general, a parametric model takes the form
$$ \mathfrak{F} = \{ f (x; \theta) : \theta \in \Theta \} $$
where $\theta$ is an unknown parameter (or vector of parameters) that
can take values in the parameter space $\Theta$. If $\theta$ is a
vector but we are only interested in one component of $\theta$, we
call the remaining parameters nuisance parameters. A nonparametric
model is a set $\mathfrak{F}$ that cannot be parameterized by a
finite number of parameters.
In many cases we use distributions as models (you can check this example). You can use binomial distribution as a model of counts of heads in series of coin throws. In such case we assume that this distribution describes, in simplified way, the actual outcomes. This does not mean that this is an only way on how you can describe such phenomenon, neither that binomial distribution is something that can be used only for this purpose. Model can use one or more distributions, while Bayesian models specify also prior distributions.
More formally this is discussed by McCullaugh (2002):
According to currently accepted theories [Cox and Hinkley (1974),
Chapter 1; Lehmann (1983), Chapter 1; Barndorff-Nielsen and Cox
(1994), Section 1.1; Bernardo and Smith (1994), Chapter 4] a
statistical model is a set of probability distributions on the sample
space $\mathcal{S}$. A parameterized statistical model is a parameter
$\Theta$ set together with a function $P : \Theta \rightarrow
\mathcal{P} (\mathcal{S})$, which assigns to each parameter point
$\mathcal{\theta \in \Theta}$ a probability distribution $P \theta$ on
$\mathcal{S}$. Here $\mathcal{P}(\mathcal{S})$ is the set of all
probability distributions on $\mathcal{S}$. In much of the following, it is
important to distinguish between the model as a function $ P : \Theta
\rightarrow \mathcal{P} (\mathcal{S}) $, and the associated set of
distributions $P\Theta \subset \mathcal{P} (\mathcal{S})$.
So statistical models use probability distributions to describe data in their terms. Parametric models are also described in terms of finite set of parameters.
This does not mean that all statistical methods need probability distributions. For example, linear regression is often described in terms of normality assumption, but in fact it is pretty robust to departures from normality and we need assumption about normality of errors for confidence intervals and hypothesis testing. So for regression to work we don't need such assumption, but to have fully specified statistical model we need to describe it in terms of random variables, so we need probability distributions. I write about this because you can often hear people saying that they used regression model for their data -- in most such cases they rather mean that they describe data in terms of linear relation between target values and predictors using some parameters, than insisting on conditional normality.
McCullagh, P. (2002). What is a statistical model? Annals of statistics, 1225-1267.
Wasserman, L. (2013). All of statistics: a concise course in statistical inference. Springer.
I hope the author of that text is a contributor to this site, because I am about to argue that they make a fundamental error, and I would like it if they were around to defend themselves.
And then what we want to do is build a predictive function so that if we get a new individual, we can predict whether they're going to respond to chemotherapy up here, as an orange person, or not respond down here as a green person.
This is subtly wrong, this is not what what we want to do. Our goal in such a study is to develop a decision rule that will advise us on how to act when presented with a case. That is, our decision rule should tell us whether we should apply the therapy to a case. This is related, but not equivalent, to prediction of whether they will respond, as I will elaborate on below.
The correct procedure for developing such a rule does involve prediction:
- Develop a model that predicts the probability that an individual will respond to treatment.
- Use the model, along with an understanding of the benefits and costs of treatment, to develop a decision rule that advises doctors on procedure.
In many problems the benefits and costs change quickly in response to our understanding of the situation, or outside influences like legislation, or new technology. If we follow the above procedure, only the decision rule has to adapt to these changes, the modeled probabilities are invariant. They only express our underlying scientific knowledge about the treatment and its affects. This is a separation of concerns, which engineers have long known is a powerful tool in organizing work.
It is important that our model predicts probabilities. This is what allows us to incorporate information about the benefits and costs into our decision rule. We can calculate the expected value and costs of treatment for an individual, and balance them according to our goals. If instead, we insist on the model telling us "responds" or "not responds" we have given up our power to make nuanced decisions based on these benefits and costs, and have ceded our ability to adapt to an ever changing landscape.
The author falls into this trap. In the picture of overlapping distributions they argue that prediction is difficult because in the regions of large overlap, the model can not meaningfully make a binary yes or no call on "responds to treatment". This is simply the truth about most situations we encounter in life. This is why it is important to base our reasoning on probabilities. Probabilities actually quantify the degree of uncertainty we have in making a yes or no call. In the overlapping distributions, there is no difficulty at all in assigning probabilities to "responds to treatment". It is only when we ignore this reality, and attempt to say with certainty what will happen that issues arise. The authors difficulty is manufactured out of their own incorrect procedure.
Another issue that comes up is that, prediction is slightly more challenging than inference.
This is not generally the opinion of most literature or the wise people I have discussed these issues with. I wonder if the the author is using some quirky definition of "prediction" and "inference".
To me, inference is using modeling to understand the true mechanisms that underlie a phenomena. We want to be able to say things like "increasing the treatment drug by x
ccs will lead to an improvement in outcomes by y
amount". To do inference, we first need a model that describes the phenomena well (the gold standard would be our ability to use the model to make predictions). We then use the shape of that model to distill understanding about what is going on.
In prediction, we don't much care about the model being introspectable. If it is too complicated for us to understand, so be it, as long as its predictions are accurate. Prediction studies loosen some of the constraints we must meet to use a model for inference. The author seems to have it backwards.
A most excellent reference that is quite readable and really helped me clarify my thinking on this subject is Shmueli: To Explain or Predict.
Best Answer
Coming from a behavioural sciences background, I associate this terminology particularly with introductory statistics textbooks. In this context the distinction is that :
The important point is that any statistic, inferential or descriptive, is a function of the sample data. A parameter is a function of the population, where the term population is the same as saying the underlying data generating process.
From this perspective the status of a given function of the data as a descriptive or inferential statistic depends on the purpose for which you are using it.
That said, some statistics are clearly more useful in describing relevant features of the data, and some are well suited to aiding inference.
So from this perspective, the important things to understand are:
Thus, you could either define the distinction between descriptive and inferential based on the intention of the researcher using the statistic, or you could define a statistic based on how it is typically used.