Solved – Likelihood in Linear Regression

likelihoodlinearlinear modelprobabilityregression

I am trying to understand how people derive the Likelihood for simple linear regression. Lets say that we just have one feature x and the outcome y. I do not doubt the expression with the normal density itself and I also do not doubt that one can factor the product into simpler factors due to independency. I doubt how people derive this expression. There seems to be a whole zoo of (partially incorrect) assumptions about the input and almost everywhere, the critical step (namyle how to derive the product of normal densities) where one actually has to use the correct assumptions is left out 🙁

What I think is natural to assume is the following: We are given a fixed training set $(x_i, y_i)_{i=1,2,…,n}$ and assume that

  1. the pairs $(x_i, y_i)$ in the fixed training set of length $n$ come from random variables $(X_i, Y_i)$ that are iid distributed
  2. $Y_i = \beta_0 X_i + \epsilon_i$
  3. the $\epsilon_i$ are one-dimensional iid random variables each distributed as $\mathcal{N}(0, \sigma)$ with $\sigma$ known (in order to simplify) (maybe one should assume something about the conditional density $f_{\epsilon_i|X_i}$ here? People seem to be uncertain what to actually assume here…)

Let $Y = (Y_1, …, Y_n)$ and let $X = (X_1, …, X_n)$. Now the goal is to determine the conditional density $f_{Y|X} = \frac{f_{(Y,X)}}{f_X}$. Clearly,
$$f_{Y|X} = \prod_{i=1}^n f_{Y_i|X_i}$$

Question:

How to proceed from here?

I do not see how the assumptions give information about $f_{(Y_i, X_i)}$ or about $f_{X_i}$ so I simply cannot compute this quantity $f_{Y_i|X_i} = \frac{f_{(Y_i, X_i)}}{f_{X_i}}$. Also, some people might think that $Y_i = \beta_0 X_i + \epsilon_i$ and $\epsilon_i$ normally distributed (or $\epsilon_i|X_i$ normally distributed) means that also $Y_i|X$ is normally distributed, but…

There is a statement for normally distributed random variables but it goes like this: If $X$ is normally distributed and $A, B$ are fixed matrices then $AX+B$ is normally distributed again. In the case above, $B$ is $\beta_0 X_i$ which is not a constant matrix.

Other sources seem to assume that $f_{Y_i|X_i}$ is normally distributed right away. This seems to be a weird assumption… how should we ever be able to test that on a real dataset?

Regards + thanks,

FW

Best Answer

The key assumption to derive $f_{Y_i|X_i}$ is that the noise is independent from the input, that is $\epsilon_i$ is independent from $X_i$. You don't need to know or assume anything about the distribution of $X_i$.

You start with:

$$f_{Y_i|X_i}(x,y)=p(Y_i=y|X_i=x)=p(\beta_0x+\epsilon_i=y|X_i=x)=p(\epsilon_i=y-\beta_0x|X_i=x)$$

Now the independence assumption is used, since $\epsilon_i$ is independent from $X_i$, its density given a value of $X_i$ is simply its density:

$$p(\epsilon_i=y-\beta_0x|X_i=x)=p(\epsilon_i=y-\beta_0x)=...e^{(y-\beta_0x)^2/2\sigma^2}$$

You could alternatively say that the distribution of the noise conditionally to $X_i$ is normal with a constant variance (and mean 0) given any value of $X_i$. This is what really matters. But this is strictly equivalent to the usual assumption:

  • $\epsilon_i$ is independent of $X_i$
  • $\epsilon_i$ is normally distributed (with mean 0)