Why is it that when calculating the sum of squared residuals in matrix form we use $e'e$ (where $e'$ is the transpose of $e$) instead of $e^2$?
Solved – When working in matrix form, why is the sum of squared residuals given by $e’e$
least squaresmatrixmultiple regressionregression
Related Solutions
"When wanting to obtain numerical estimates based on a likelihood, is it ever wrong to just minimize the sum of the residuals squared." --- Almost always. If the parameter appears in the likelihood function in a very particular way, then ML corresponds to least squares.
In particular, consider the simple case of a single location parameter, $\mu$.
For us to get least squares, we need to use as our estimate, $\hat{\mu}^\text{LS}$ the value of $\mu$ that minimizes $\sum (y_i-\mu)^2$.
So when it comes to maximizing likelihood where the data has density $f$, note that with independent observations, we want to find $\hat{\mu}^\text{ML}$, the value of $\mu$ that maximizes $\prod_i f(y_i;\mu)$. Let $g=\log f$. Then that's the same as maximizing $\sum_i g(y_i;\mu)$.
That's the same as minimizing $c-k \sum_i g(y_i;\mu)$ for any convenient positive $k$ and any convenient real $c$.
So as long as $c-k \log f(y;\mu)=(y-\mu)^2$ for some convenient $c$ and $k$, ML will be least squares.
Consequently $f(y;\mu)=e^{-\frac{1}{k}(y-\mu)^2+\frac{c}{k}}$ for some $k$ and $c$.
The normal density with mean $\mu$ and some given variance $\sigma^2$ is of this form (for suitable choice of constants - $k$ is a function of $\sigma^2$ and $c/k$ will also be a function of $\sigma$ that serves to normalize it to a density.
So we see in that simple case at least, that least squares estimate can be had by finding the ML estimate for a normal location parameter. Many more complicated situations (including regression) work in essentially identical fashion -- to get least squares to be ML, start with estimating location parameters for Gaussian distributed variables.
So if you pick something else for $f$, the MLE for the location parameter doesn't come out to be least squares.
As for the numerical difference: if you're comparing the sum of squares as a function of $\mu$ and $-2\log\mathcal{L(\mu)}$ at the univariate normal ... while their argmins coincide, the value of the functions at the argmins might differ numerically due to the $k$ and $c$ above, which depends on the variance and the sample size.
Consider any likelihood where the likelihood can be written as a function of the residuals squared. Then, numerically speaking from an optimization standpoint, what is the difference between maximizing the likelihood and minimizing the sum of squares
If by 'a function of the residuals squared', then if you mean some other $\mathcal{l}((y_i-\mu)^2)$ than a straight $\sum (y_i-\mu)^2$, then all sorts of possibilities exist.
In comments, whuber mentions $\sum_i \sqrt{(y_i-\mu)^2} = \sum |y_i-\mu|$, which is a function of squared residuals which is not least squares, but of course there are infinitely many other such functions that are not least squares, some of which may correspond to ML estimators.
Consider the location-scale family of $t_\nu$-distributions, for example. For simplicity, take the scale and $\nu$ to be fixed.
These also have likelihoods which are functions of the squared residuals, but least squares is not ML for them.
After some investigation, I think I found a small (but crucial!) imprecision in what your post.
The first formula you wrote : $var(\varepsilon) = \sigma^2 (I - H)$ is actually not totally exact. The formula should be $var(\hat \varepsilon) = \sigma ^2 (I - H)$ where $\hat\varepsilon = Y - \hat\beta X$ considering the OLS estimator $\hat\beta = (X^TX)^{-1}X^TY$. Thus $\hat\sigma(I - H)$ is an estimator of the variance of the estimated residuals associated with OLS estimator. This formula does not suppose independance of the $\varepsilon_i$, just that they all have same variance $\sigma^2$. But this is not what you want! You want an estimate of the variance of the true residuals, not the estimated residuals under OLS estimation. OLS estimator corresponds to maximum likelihood estimator under the hypothesis that residuals are i.i.d. and normal. The estimated residuals can thus be very poor estimates of the true residuals if these hypothesis are not met, and there covariance matrix can be very different from the covariance of the true residuals.
The second formula you wrote does correspond to the covariance matrix of the $\varepsilon_i$ under the hypothesis that they follow an AR(1) process.
Estimating covariance matrix of the residuals of a linear regression without any asumption cannot easily be done: you would have more unknown than datapoints... So you need to specify some form for the covariance matrix of the residuals. Supposing that they follow an AR(1) process (if this is relevent) is a way of doing so. You can also assume that they have a stationnary parametrized autocorrelation function, whose parameters you can estimate, and use it to deduce the covariance matrix.
Best Answer
In matrix notation, the residuals are typically written as an $n$ length column vector $\mathbf{e}$ where $n$ is the number of observations.
$$ \mathbf{e} = \left[ \begin{array}{c} e_1 \\ e_2 \\ \ldots \\ e_n \end{array} \right]$$
Then:
$$ \mathbf{e}'\mathbf{e} = \left[ \begin{array}{cccc} e_1 & e_2 & \ldots & e_n \end{array} \right] \left[ \begin{array}{c} e_1 \\ e_2 \\ \ldots \\ e_n \end{array} \right] = \sum_i e_i^2$$
In contrast, $\mathbf{e}\mathbf{e}$ is rather sloppy (arguably downright wrong) since you can't multiply an $n$ by 1 matrix by another $n$ by 1 matrix.
Notation note: I personally like the engineering convention that bold lowercase letters denote vectors and normal lowercase letters denote scalars; this notation reduces confusion about what's a vector and what's a scalar.