Solved – What does “independent observations” mean

assumptionsindependencemultilevel-analysisprobabilitysampling

I'm trying to understand what the assumption of independent observations means. Some definitions are:

  1. "Two events are independent if and only if $P(a \cap b) = P(a) * P(b)$." (Statistical Terms Dictionary)
  2. "the occurrence of one event doesn't change the probability for another" (Wikipedia).
  3. "sampling of one observation does not affect the choice of the second observation" (David M. Lane).

An example of dependent observations that's often given is students nested within teachers as below. Let's assume that teachers influence students but students don't influence one another.

So how are these definitions violated for these data? Sampling [grade = 7] for [student = 1] does not affect the probability distribution for the grade that will be sampled next. (Or does it? And if so, then what does observation 1 predict regarding the next observation?)

Why would the observations be independent if I had measured gender instead of teacher_id? Don't they affect the observations in the same way?

teacher_id   student_id   grade
         1            1       7
         1            2       7
         1            3       6
         2            4       8
         2            5       8
         2            6       9

Best Answer

In probability theory, statistical independence (which is not the same as causal independence) is defined as your property (3), but (1) follows as a consequence$\dagger$. The events $\mathcal{A}$ and $\mathcal{B}$ are said to be statistically independent if and only if:

$$\mathbb{P}(\mathcal{A} \cap \mathcal{B}) = \mathbb{P}(\mathcal{A}) \cdot \mathbb{P}(\mathcal{B}) .$$

If $\mathbb{P}(\mathcal{B}) > 0$ then if follows that:

$$\mathbb{P}(\mathcal{A} |\mathcal{B}) = \frac{\mathbb{P}(\mathcal{A} \cap \mathcal{B})}{\mathbb{P}(\mathcal{B})} = \frac{\mathbb{P}(\mathcal{A}) \cdot \mathbb{P}(\mathcal{B})}{\mathbb{P}(\mathcal{B})} = \mathbb{P}(\mathcal{A}) .$$

This means that statistical independence implies that the occurrence of one event does not affect the probability of the other. Another way of saying this is that the occurrence of one event should not change your beliefs about the other. The concept of statistical independence is generally extended from events to random variables in a way that allows analogous statements to be made for random variables, including continuous random variables (which have zero probability of any particular outcome). Treatment of independence for random variables basically involves the same definitions applied to distribution functions.


It is crucial to understand that independence is a very strong property - if events are statistically independent then (by definition) we cannot learn about one from observing the other. For this reason, statistical models generally involve assumptions of conditional independence, given some underlying distribution or parameters. The exact conceptual framework depends on whether one is using Bayesian methods or classical methods. The former involves explicit dependence between observable values, while the latter involves a (complicated and subtle) implicit form of dependence. Understanding this issue properly requires a bit of understanding of classical versus Bayesian statistics.

Statistical models will often say they use an assumption that sequences of random variables are "independent and identically distributed (IID)". For example, you might have an observable sequence $X_1, X_2, X_3, ... \sim \text{IID N} (\mu, \sigma^2)$, which means that each observable random variable $X_i$ is normally distributed with mean $\mu$ and standard deviation $\sigma$. Each of the random variables in the sequence is "independent" of the others in the sense that its outcome does not change the stated distribution of the other values. In this kind of model we use the observed values of the sequence to estimate the parameters in the model, and we can then in turn predict unobserved values of the sequence. This necessarily involves using some observed values to learn about others.

Bayesian statistics: Everything is conceptually simple. Assume that $X_1, X_2, X_3, ... $ are conditionally IID given the parameters $\mu$ and $\sigma$, and treat those unknown parameters as random variables. Given any non-degenerate prior distribution for these parameters, the values in the observable sequence are (unconditionally) dependent, generally with positive correlation. Hence, it makes perfect sense that we use observed outcomes to predict later unobserved outcomes - they are conditionally independent, but unconditionally dependent.

Classical statistics: This is quite complicated and subtle. Assume that $X_1, X_2, X_3, ... $ are IID given the parameters $\mu$ and $\sigma$, but treat those parameters as "unknown constants". Since the parameters are treated as constants, there is no clear difference between conditional and unconditional independence in this case. Nevertheless, we still use the observed values to estimate the parameters and make predictions of the unobserved values. Hence, we use the observed outcomes to predict later unobserved outcomes even though they are notionally "independent" of each other. This apparent incongruity is discussed in detail in O'Neill, B. (2009) Exchangeability, Correlation and Bayes' Effect. International Statistical Review 77(2), pp. 241 - 250.


Applying this to your student grades data, you would probably model something like this by assuming that grade is conditionally independent given teacher_id. You would use the data to make inferences about the grading distribution for each teacher (which would not be assumed to be the same) and this would allow you to make predictions about the unknown grade of another student. Because the grade variable is used in the inference, it will affect your predictions of any unknown grade variable for another student. Replacing teacher_id with gender does not change this; in either case you have a variable that you might use as a predictor of grade.

If you use Bayesian method you will have an explicit assumption of conditional independence and a prior distribution for the teachers' grade distributions, and this leads to unconditional (predictive) dependence of grades, allowing you to rationally use one grade in your prediction of another. If you are using classical statistics you will have an assumption of independence (based on parameters that are "unknown constants") and you will use classical statistical prediction methods that allow you to use one grade to predict another.


$\dagger$ There are some foundational presentations of probability theory that define independence via the conditional probability statement and then give the joint probability statement as a consequence. This is less common.

Related Question