You're confusing apples with oranges. That's ok, because they are both delicious.

Maximum likelihood estimation is about *what* you minimize, gradient descent is about *how* you minimize it.

Why not MLE for linear regression?

In fact, linear regression *is* solved with maximum likelihood estimation. The standard "minimize the sum of squared errors" method is exactly mathematically equivalent to maximum likelihood estimation using a conditional normal distribution.

Why not gradient descent for logistic regression?

You can totally solve logistic regression by minimizing the likelihood function *using* gradient descent. It's a great exercise in fact, and I'd recommend everyone do it at least once.

Gradient descent is not the standard method though. That prize goes to iteratively re-weighted least squares / Newton's method, which is an enhancement to gradient descent that takes into account the second derivative as well. This method just turns out to have much better properties than gradient descent, but is trickier to understand and implement.

**The linear regression model**

$Y = X\beta + \epsilon$, where $\epsilon \sim N(0,I\sigma^2)$

$Y \in \mathbb{R}^{n}$, $X \in \mathbb{R}^{n \times p}$ and $\beta \in \mathbb{R}^{p}$

Note that our model error (residual) is ${\bf \epsilon = Y - X\beta}$. Our goal is to find a vector of $\beta$s that minimize the $L_2$ norm squared of this error.

**Least Squares**

Given data $(x_1,y_1),...,(x_n,y_n)$ where each $x_{i}$ is $p$ dimensional, we seek to find:

$$\widehat{\beta}_{LS} = {\underset \beta {\text{argmin}}} ||{\bf \epsilon}||^2 = {\underset \beta {\text{argmin}}} ||{\bf Y - X\beta}||^2 = {\underset \beta {\text{argmin}}} \sum_{i=1}^{n} ( y_i - x_{i}\beta)^2 $$

**Maximum Likelihood**

Using the model above, we can set up the likelihood of the data given the parameters $\beta$ as:

$$L(Y|X,\beta) = \prod_{i=1}^{n} f(y_i|x_i,\beta) $$

where $f(y_i|x_i,\beta)$ is the pdf of a normal distribution with mean 0 and variance $\sigma^2$. Plugging it in:

$$L(Y|X,\beta) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(y_i - x_i\beta)^2}{2\sigma^2}}$$

Now generally when dealing with likelihoods its mathematically easier to take the log before continuing (products become sums, exponentials go away), so let's do that.

$$\log L(Y|X,\beta) = \sum_{i=1}^{n} \log(\frac{1}{\sqrt{2\pi\sigma^2}}) -\frac{(y_i - x_i\beta)^2}{2\sigma^2}$$

Since we want the maximum likelihood estimate, we want to find the maximum of the equation above, with respect to $\beta$. The first term doesn't impact our estimate of $\beta$, so we can ignore it:

$$ \widehat{\beta}_{MLE} = {\underset \beta {\text{argmax}}} \sum_{i=1}^{n} -\frac{(y_i - x_i\beta)^2}{2\sigma^2}$$

Note that the denominator is a constant with respect to $\beta$. Finally, notice that there is a negative sign in front of the sum. So finding the maximum of a negative number is like finding the minimum of it without the negative. In other words:

$$ \widehat{\beta}_{MLE} = {\underset \beta {\text{argmin}}} \sum_{i=1}^{n} (y_i - x_i\beta)^2 = \widehat{\beta}_{LS}$$

Recall that for this to work, we had to make certain model assumptions (normality of error terms, 0 mean, constant variance). This makes least squares equivalent to MLE under certain conditions. See here and here for more discussion.

For completeness, note that the solution can be written as:

$${\bf \beta = (X^TX)^{-1}X^Ty} $$

## Best Answer

Linear regression is, as the name suggests, a linear model. This enables us to use linear algebra to find its parameters, this is called ordinary least squares. We cannot use OLS for generalized linear models like logistic regression, because they are non-linear. GLMs are defined in terms of a linear predictor

$$ \eta = \boldsymbol{X} \beta $$

that is passed through the link function $g$ to obtain the prediction

$$ E(Y\,|\,\boldsymbol{X} ) = \mu = g^{-1}(\eta) $$

Because the link function is non-linear, we cannot use linear algebra to find the parameters, but we need an optimization algorithm. Since generalized linear models are defined in terms of conditional distributions, we can fit them using maximum likelihood. On another hand, you can fit non-linear curves to the data by minimizing squared error, but because of the non-linearity, to do it you would use an optimization algorithm as well.