Regression – Is the IID Assumption Necessary in Linear Regression?

machine learningprobabilityregression

In linear or logistic regression, we have the following (adapted from Foundations of machine learning.):

As in all supervised learning problems, the learner $\mathcal{A}$ receives a labeled sample dataset $\mathcal{S}$
containing $N$ i.i.d. samples $\left(\mathbf{x}^{(n)}, y^{(n)}\right)$ drawn from $\mathbb{P}_{\mathcal{D}}$:

$$
\mathcal{S} = \left\{\left(\mathbf{x}^{(1)}, y^{(1)}\right), \left(\mathbf{x}^{(2)}, y^{(2)}\right), \ldots, \left(\mathbf{x}^{(N)}, y^{(N)}\right)\right\} \subset \mathbb{R}^{D} \quad \overset{\small{\text{i.i.d.}}}{\sim} \quad \mathbb{P}_{\mathcal{D}}\left(\mathcal{X}, \mathcal{Y} ; \boldsymbol{\beta}\right)
$$


I am used to the iid assumption in machine learning, but in the case of conditional maximum likelihood, I have the following question.

To use maximum likelihood for linear/logistic regression, it is required to have $y \mid x$ to be independent, in other words, $y$ is conditionally independent of $x$. The question is, do we need the strong iid assumption mentioned above for us to invoke MLE?

Best Answer

If you are asking about the i.i.d. assumption in machine learning in general, we already have that question answered in the On the importance of the i.i.d. assumption in statistical learning question.

As about maximum likelihood, notice that the likelihood function is often written as

$$ \prod_{i=1}^N p(x_i | \theta) $$

where $p(x_i | \theta)$ is probability density or mass function for the point $x_i$ parameterized by $\theta$. We are multiplying because we are making the independence assumption; otherwise the joint distribution would not be a product of the individual distributions. Moreover, $p(\cdot | \theta)$ are all the same, so they are "identical", and hence we are talking about the i.i.d. assumption. This does not mean that every likelihood function would assume independence, but that is often the case. The identical distributions assumption also is not necessary, e.g. you can have a mixture model (e.g. clustering), where you assume that individual samples come from different distributions, together forming a mixture.

Notice that with maximum likelihood we are directly making such assumptions. If you are fitting a decision tree or $k$NN you are not maximizing any likelihood, the algorithms do not explicitly assume any probability distribution, so you are also not explicitly making such a assumption. It still is the case, however, that you are assuming that your data is "all alike" (so a kind of i.i.d. or exchangeability): for example, you wouldn't mix data from completely different domains (say, ice-cream sales, size of brain tumors, and speed of Formula 1 cars) together and expect it to return reasonable predictions.

As for logistic regression, that is discussed in the Is there i.i.d. assumption on logistic regression? thread.

It would be a tautology, but the assumptions that you made need to hold. If your model assumes that the samples are independent, then you need the independence assumption.

Related Question