[Math] Machine Learning, why not use matrix multiplication instead of gradient descent

descentmachine learningmatricesregression

If we want to minimize our Cost function for a given set of data, why do we use gradient descent and continually guess values until we find a min value for theta when when can just use matrix multiplication to solve for theta with the equation:

$$a = (M^T M)^{-1} M^T y$$

where ${}^T$ means "transpose" and $a$ is the column matrix of Theta values.

Best Answer

The quick answer is that for optimization problems over big matrices, matrix-inversion can be very computationally expensive, and so avoiding inversion becomes an important time-saving and space-saving measure. For a sufficiently small set of data, you might be better off directly finding $(M^T M)^{-1}$.

Your first question asks why $x_0$ should be $1$:

Let's look at the hypothesis, $h_\theta(x) = \theta_0 +\theta_1x_1$, where $x_1$ is the number of bedrooms in the house. We can clearly see here that the coefficient of $x_1$ is $\theta_1$. Now, what we could do, is rewrite the hypothesis like this: $h_\theta(x) = \theta_0x_0 +\theta_1x_1$, where $x_0 = 1$, and really, we wouldn't have changed a thing, since we're just multiplying $\theta_0$ by one.

OK, but why do this? Well --- it's just a trick to make our notation simpler. If we define the vector $\mathbf{x} = \begin{pmatrix}x_0\\x_1\end{pmatrix}$ and $\mathbf{\theta} = \begin{pmatrix}\theta_0\\\theta_1\end{pmatrix}$, then we can concisely write the hypothesis as: $h_\theta(x) = \mathbf{\theta^T}\mathbf{x}$. In the case of a 2 dimensional feature vector, this may not seem like a big deal, but once we start moving to higher and higher dimensions, this notation is a lot easier to deal with.

Your second question asks about initial values of $\theta$:

I want to clarify reason behind the cost function, and how gradient descent is used and hopefully this will clear up any confusion you have. Look at it like this: the ultimate goal here is to estimate the parameters $\theta_0$ and $\theta_1$. What we need is some way to measure "how good" these estimates are. One way we could do this is by minimizing the difference between the right answers, the $y^{(i)}$s, and our estimated values, the $h_\theta(x^{(i)})$s.

We do this by defining the cost function: $J(\theta) = \frac{1}{2}\sum\limits_{i = 1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2$. We wish to minimize this function $J(\theta)$ since it represents the sum of the squared errors our hypothesis has made. In other words, we wish to find values for our $\theta$s which minimize $J(\theta)$. The key thing to realize is that we want to find our $\theta$s.

Gradient descent is simply one way we can do this. I'm not going to go into detail about it, but I will mention that you can start with random $\theta$s. Using gradient descent, we iteratively reach values which minimize $J(\theta)$. There are a bunch of different flavours of gradient descent -- batch, minibatch and stochastic gradient descent, but they essentially do the same thing.

I don't want to confuse you, but gradient descent is simply one method we may use in order to find our parameters, $\mathbf{\theta}$. We could also use the normal equations, for example, to do so.

(conceptual question) What kind of minima do we expect stochastic gradient descent to get stuck on and why

EDITS BELOW OP

OP:

You're assuming all data are sampled from the same distribution, so on average the gradient computed at any step will be roughly equivalent to that of the global gradient. SGD is just gradient descent except, rather than computing the cost on the whole set, you just do it on a random subset. This is for speed. But the crucial point lies in that you're randomly selecting from a set whose distribution is equivalent to that of the entire set.

It can't escape the minima for the same reason that regular GD can't - the gradient always points "downhill", and if you're in a local minimum then, by definition, every direction is "uphill".

In general we don't know that there isn't a better (or worse) minimum. Indeed, many methods utilize lots of random initializations to try to mitigate problems from bad initial guesses (look up simulated annealing)
There is no definitive solution that works in every situation on avoiding local minima. Look up "gradient descent with momentum" - this is one method that adds a "momentum" term (inspired from physics) to help be robust against local minima, but if they are deep enough or "wide" enough, it will not matter.

This is why people try so hard to cast problems in a way where convex optimization can be used

Both BGD and SGD look for the nearest minimum. SGD is just a "low resolution" version of BGD.
We are not certain. I'm sure someone could do an analysis on the liklihood of leaving a local minimum of a certain description given the distribution the sample is pulled from, but I've not done this nor seen it done. It's not very important because in general we don't have a full description of the cost function in general. Remember, we are sampling the cost function at a point - in solving a real problem you would never have a graph like you do (otherwise, what's the point of SGD - just look at the graph)
This is a little weirdly phrased, but I think you're asking why gradient descent is more robust to small local minima. It stems from the stochasticity introduced by only sampling a small portion of the data to get your cost function. If you just happen to subsample points which have a gradient that pushes you out of a "true" local minimum, then great. You would never know that you have been saved from a "true" local minimum though.
Again, SGD was not invented to be robust against local minima. It was a simple extension of regular GD for the sake of speed and incremental updates in e.g. neural networks. If you want robustness against local minima, any form of regular GD is not what you're looking for. Again, think about adding a momentum term or checking out some other optimizers

Edit 1:

First, thanks to @Hyperplane for pointing out that SGD is not just used for speed, but also regularization. Check out the paper he linked below in the comments if interested.

Response to edits in the original post:

Since we randomly select each batch ... the whole dataset

This is essentially correct. We make the probabilistic assumption that all data are sampled from the same distribution. The random sample will exhibit different statistical attributes than the whole set (e.g. sample mean), but on average we expect that the gradient of the cost function computed from the subsample will be similar to that of the entire sample.

Note that in some cases in your GIF, it is actually quite far off - the expected difference will shrink as your subsample grows, and determining how big to make your subsample is a consideration you have to have.

However there is still a thing that confuses me ... pushed again towards a shallower minimum

There are a few things to say here.

I'm not sure if I can say with mathematical certainty that SGD is more likely to reach a better minimum - I have no proof of this. We can see intuitively that there could be reason for this, so it is "worth" trying in practice.
The step size relative to the "size" of the minimum matters. Let's assume for the sake of simplicity that our minimum sits at the bottom of a surface which describes the bottom half of a sphere. If the step size is 10 times the radius of sphere, we simply cannot mathematically get caught in this minimum (unless of course we happen to land exactly on the minimum)

On the other hand, if the radius of the sphere is 10e6 times greater than the step size, it is virtually impossible to leave this "bowl".
To directly answer your question about progressively "jumping up" to higher and higher minima, the main answer is the importance of tuning your learning rate (step size). Having a sufficiently small learning rate will make this virtually impossible.

Question 1 answer: Again, you're highlighting the importance of learning rate. With a sufficiently small learning rate, it is very unlikely that you would end up in a much higher minimum. If the learning rate is very high, it could happen or you could essentially just end up bouncing around randomly. If the learning rate is "medium" then you could end up in the global minimum or perhaps either of the first two to three "up the bowl" on either side of the global minimum. Depends on learning rate, small batch size, and the stochasticity introduced from random selection.

Despite the fact that ... In both these cases SGD can fail

I was actually going to answer something like this above, but I thought it was confusing. Either way here it is: The steepness of the minimum makes this very unlikely given a reasonable step size. Let's say for the sake of example that the gradient is -20 on the left and +20 on the right on the minimum. Remember that we expect our small batches to have similar gradients (derivatives) to the whole dataset. So we might expect a gradient range of say 15 to 25 in magnitude (negative on the left, positive on the right). We are still pointing in the right direction. Could you "hop out"? Sure. But it's very unlikely and would be a "fluke" of a random sample.

Question 2 answer: Depends on learning rate and initialization. If the learning rate is pretty low, then you could very well get stuck in any of the minima near x = -0.5, 0, or 2. This isn't really a bad thing in this particular case for either the 0 or 2 minima.

Question 3 answer: Again, "wide" is hard to define when you can have a function that is at any scale. Are we concerned with width-to-depth ratio (rhetorical)? In short the answer is no - if the learning rate is small enough then either will work. I can expand if you need.

Question 4 answer: I'm not sure what you're saying here. Guess the size of what - the "size" of the "bowls" around the minima? Recall that you can get much more complex behavior in higher dimensions (e.g. saddle points in 2D) which is where most applications lie (in higher dimensions I mean). In general, you don't know the "scale" of the relevant features of the function.

Take the 2nd function you added. It exists in the range -5 to 5 with important features on that order. You could've just as easily made this $-5^{-10}$ to $5^{-10}$, or $-5^{10}$ to $5^{10}$. A learning rate of say 0.01 could be useful for your function, but would be hopelessly useless for the other 2 scales (even impossible for the first)

Question 5 answer: As above, depth is important because even if the small batch gradients are not equal to the whole batch gradient, they will still generally be pointing in the right direction. The smaller the magnitude of the gradient, the easier it is for random sampling to go in a "bad" direction.

On the other hand, even though it is easier to go in a "bad" direction in a wider bowl, you have to take many "bad" steps in succession to leave the bowl. So there is a tradeoff, and we have seen again the importance of tuning your learning rate (and small batch size).

Look again at your GIF - around the big hills, the small batch derivates pretty closely math - at least they have the right direction. However, on the far left of your function, the derivatives don't as closely match.

Question 6 answer: You assume nothing beyond the assumptions of regular GD (namely continuity and differentiability). In general, SGD is used because it is faster and, as mentioned by @Hyperplane, is a form of regularization. It may leave local minima (but so too might regular GD with a high enough learning rate). I cannot give you a definitive rule for when it might be better to use one or the other. I'm not saying such a rule doesn't exist, but I do not know it if it does.

Edits 2:

Confusion 1 response: It depends what the source of the escape is. If you are asking whether or not the stochasticity introduced from using a minibatch is less likely to be the cause of leaving a deep minimum, then yes. The deeper the minimum is (given a fixed width), the higher the magnitude of the derivative and the less likely a minibatch will have a derivative that points in the "wrong" direction (i.e. has a different sign than the true whole batch derivative, in the 1D case).

If the source of the escape is that your learning rate is not tuned properly, then having a deeper minimum could actually make you more likely to leave. Remember, you're not really stepping "downhill" - you're taking a step left or right (in the 1D case) whose size depends on your learning rate and the derivative at that point. If the derivative has higher magnitude (i.e. the minimum is deeper), then it means you are taking a bigger step to the right or left. If the learning rate is high enough, you could very well take a step big enough to leave the well.

the smaller the batch size and the bigger the learning rate, the deeper minima the red point is looking for

This is not true for the reasons given in my last paragraph. Higher learning rate does not mean you are walking "downhill" with bigger steps, it means you are walking to the left or right with bigger steps and then "ending up on the function" at whatever altitude happens to correspond to that x value.

Imagine this: I have a global minimum at (0,0), and a global maximum at (1, 1000). Let's say I'm at x=-0.5, and the cost function happens to have a derivative of -1 there (i.e. "pointing" towards the minimum). Let's also assume that my learning rate is 1.5. **Then my update will take my current position and add $-derivative*learning\_rate=-(-1)*1.5=+1.5$. Therefore, my new x becomes $x_{new} = x_{old} + 1.5 = -0.5 + 1.5 = 1$. Look at that! I've ended up increasing my cost from 0 to 1000; and, worse yet, the derivative of the cost function at my new x value is 0, so I am stuck here.

This is obviously a very convenient and unlikely scenario, but it highlights the very important point that steps are taken on the x-axis, not on the "function surface". It is just to point out that you can leave minima if they are very skinny and deep if your learning rate is on the order of their width.

Confusion 2 response: We have been using the term "minimum" relatively loosely, and while it is convenient, there is a problem here. You have said

Now, when the red point gets into the minimum C, then each of 100 observations has 0 error and therefore 0 gradient. So, whatever batch you choose it's going to have 0 gradient, since its gradient is the sum of gradients of the observations the batch consists of. Consequently, it's impossible for the red point to escape from the minimum C.

We have been talking about being "in" a minimum as being in the "bowl" that surrounds it. The minimum itself is only the single point at which the derivative is 0. If you happen to land on a minimum exactly, then it is true you will never be able to leave. All updates will move you exactly 0. This is very unlikely. With the right learning rate, you will asymptotically converge to a minimum, but not reach it. Even if you happen to land on one (overwhelmingly unlikely for functions like those you show - the set of minima is measure zero), then you could still be pushed off of it just because of the errors introduced from floating-point computations.

I think the above might've answered your number 2 bullet under this confusion. In general, if you end up in a steep minimum with a low learning rate, you will never leave (steep and low are both relative here).

Confusion 3 answer: I'm a bit confused by your wording here. Importantly, "reduction in cost function" is not the same thing as a "deeper minimum" for the reason I mentioned above. Deeper minimum just means bigger magnitude derivative means bigger steps left or right. This could be highly detrimental.

Ultimately I think you shouldn't worry too much about this. Just think of SGD as a fast, low-resolution version of GD which happens to be slightly robust against extremely weak minima - this is more of an oddity than anything. If you are concerned with robustness then again look at (S)GD + Momentum.