Unless the closed form solution is extremely expensive to compute, it generally is the way to go when it is available. However,

For most nonlinear regression problems there is no closed form solution.

Even in linear regression (one of the few cases where a closed form solution is available), it may be impractical to use the formula. The following example shows one way in which this can happen.

For linear regression on a model of the form $y=X\beta$, where $X$ is a matrix with full column rank, the least squares solution,

$\hat{\beta} = \arg \min \| X \beta -y \|_{2}$

is given by

$\hat{\beta}=(X^{T}X)^{-1}X^{T}y$

Now, imagine that $X$ is a very large but sparse matrix. e.g. $X$ might have 100,000 columns and 1,000,000 rows, but only 0.001% of the entries in $X$ are nonzero. There are specialized data structures for storing only the nonzero entries of such sparse matrices.

Also imagine that we're unlucky, and $X^{T}X$ is a fairly dense matrix with a much higher percentage of nonzero entries. Storing a dense 100,000 by 100,000 element $X^{T}X$ matrix would then require $1 \times 10^{10}$ floating point numbers (at 8 bytes per number, this comes to 80 gigabytes.) This would be impractical to store on anything but a supercomputer. Furthermore, the inverse of this matrix (or more commonly a Cholesky factor) would also tend to have mostly nonzero entries.

However, there are iterative methods for solving the least squares problem that require no more storage than $X$, $y$, and $\hat{\beta}$ and never explicitly form the matrix product $X^{T}X$.

In this situation, using an iterative method is much more computationally efficient than using the closed form solution to the least squares problem.

This example might seem absurdly large. However, large sparse least squares problems of this size are routinely solved by iterative methods on desktop computers in seismic tomography research.

Assuming that `mus[d]`

is $\mu_j$, `j.sigma`

is $\Sigma_j$, and `G(x)/M(x)`

indeed computes the posterior probability of component $j$ given the data $x$,
$$p(j \mid x) = \frac{\rho_j \mathcal{N}_x(\mu_j, \Sigma_j)}{\sum_k \rho_k \mathcal{N}_x(\mu_k, \Sigma_k)},$$
the gradient itself seems correct to me. But here are some things that I noticed that might help you to find your problem:

- I would expect the access to the mean, the covariance, and the calculation of the posterior to all involve either
`j`

or `d`

, whichever variable represents the component for which you want to compute the gradient in your code. If you tell us what `j`

and `d`

stand for, we might be able to tell you more.
- If
`G(x)/M(x)`

accesses `j.Sigma`

to compute the posterior, your code might not compute what you think it does. It might be better to first compute *all* the gradients of *all* the parameters, and then perform the update.
Stochastic gradient descent is usually not the first choice to optimize mixtures of Gaussians. Most often, expectation maximization (EM) is used (see, for example, Bishop, 2007). Even if you don't use EM, you might want to consider BFGS or L-BFGS (implemented in `scipy.optimize`

) before using SGD. And even if you stick to SGD, you should consider using multiple data points ("batches") at a time to estimate the gradient, or at least including a momentum term. Briefly looking at Toscano and McMurray's paper, my guess is that they chose to use SGD because they were interested in modeling speech acquisition in a biologically more plausible way, rather than obtaining the best possible fit, and doing this online (that is, one data point at a time). If you don't need this, my advice would be to use EM.

(I just realized you specifically asked for *online* learning, so the only viable option for you might be to add the momentum term to speed things up a bit.)

The way you chose to compute the gradient is quite inefficient, which will further slow down learning. You might not have seen reasonable results because it takes forever before the algorithm converges to something interesting. Here is a slightly better way to calculate the gradient:

```
sigmaInv = inv(j.sigma)
dSigma = G(x)/M(x) * 0.5 * (-sigmaInv + numpy.sum(sigmaInv.dot(x - mus[d]) * x))
```

There are still ways to further improve the computation of the gradient. For example, we still get a valid ascent (although not a steepest ascent) direction if we multiple the gradient by a positive definite matrix (such as $\Sigma_j$, which would simplify the gradient a bit). It might also work better if we used a different parametrization of the covariance, such as Cholesky factors, and computed the gradients of those instead.

## Best Answer

The main reason why gradient descent is used for linear regression is the computational complexity: it's computationally cheaper (faster) to find the solution using the gradient descent in some cases.

The formula which you wrote looks very simple, even computationally, because it only works for univariate case, i.e. when you have only one variable. In the multivariate case, when you have many variables, the formulae is slightly more complicated on paper and requires

muchmore calculations when you implement it in software: $$\beta=(X'X)^{-1}X'Y$$ Here, you need to calculate the matrix $X'X$ then invert it (see note below). It's an expensive calculation. For your reference, the (design) matrix X has K+1 columns where K is the number of predictors and N rows of observations. In a machine learning algorithm you can end up with K>1000 and N>1,000,000. The $X'X$ matrix itself takes a little while to calculate, then you have to invert $K\times K$ matrix - this is expensive.So, the gradient descent allows to save a lot of time on calculations. Moreover, the way it's done allows for a trivial parallelization, i.e. distributing the calculations across multiple processors or machines. The linear algebra solution can also be parallelized but it's more complicated and still expensive.

Additionally, there are versions of gradient descent when you keep only a piece of your data in memory, lowering the requirements for computer memory. Overall, for extra large problems it's more efficient than linear algebra solution.

This becomes even more important as the dimensionality increases, when you have thousands of variables like in machine learning.

Remark. I was surprised by how much attention is given to the gradient descent in Ng's lectures. He spends nontrivial amount of time talking about it, maybe 20% of entire course. To me it's just an implementation detail, it's how exactly you find the optimum. The key is in formulating the optimization problem, and how exactly you find it is nonessential. I wouldn't worry about it too much. Leave it to computer science people, and focus on what's important to you as a statistician.Having said this I must qualify by saying that it is indeed

important to understandthecomputational complexity and numerical stabilityof the solution algorithms. I still don't think you must know the details of implementation and code of the algorithms. It's not the best use of your time as a statistician usually.Note 1. I wrote that you have to invert the matrix for didactic purposes and it's not how usually you solve the equation. In practice, the linear algebra problems are solved by using some kind of factorization such as QR, where you don't directly invert the matrix but do some other mathematically equivalent manipulations to get an answer. You do this because matrix inversion is an expensive and numerically unstable operation in many cases.This brings up another little advantageof the gradient descent algorithm as a side effect: it works even when the design matrix has collinearity issues. The usual linear algebra path would blow up and gradient descent will keep going even for collinear predictors.