Solved – Proof of correctness of normal equation

machine learningmultiple regressionoptimizationregression

I was taking an online course and saw linear regression being by gradient descent The intuition behind why the method would work seemed plausible.

I tried understanding normal equation as to why setting individual partial derivates to 0 and solving the equations give the optimal values of theta, but that didn't ring a bell.

Why setting partial derivatives to zero, and solving the equations gives optimal value of theta.
I also went through the following link. The part about
The minimum is determined by calculating the partial derivatives of S(β1,β2) with respect to β1 and β2 and setting them to zero

is still dicey to me.
Why would it work. Why would it give the values of β1 and β2 for minimized cost function.
If we have multiple local optima, would the normal equation method give the global optima? If so, why?

Best Answer

This is really just a matter of using standard results in multivariate calculus. In the ordinary least-squares (OLS) problem, the objective function being minimised is:

$$F(\boldsymbol{\beta}) = ||\mathbf{y} - \mathbf{x} \boldsymbol{\beta} ||^2.$$

The first and second derivatives (gradiant and Hessian) are:

$$\begin{equation} \begin{aligned} \nabla F(\boldsymbol{\beta}) &= 2 [ (\mathbf{x}^\text{T} \mathbf{x}) \boldsymbol{\beta} - (\mathbf{x}^\text{T} \mathbf{y}) ], \\[6pt] \nabla^2 F(\boldsymbol{\beta}) &= 2 (\mathbf{x}^\text{T} \mathbf{x}). \\[6pt] \end{aligned} \end{equation}$$

The Gramian matrix $\mathbf{x}^\text{T} \mathbf{x}$ is non-negative definite. So long as the columns of the design matrix $\mathbb{x}$ are linearly independent (which is generally the case in regression problems) it is positive definite. In the latter case the objective function $F$ is strictly convex, which means that it has a global minimising value at its unique critical point, which is found by solving:

$$\nabla F(\hat{\boldsymbol{\beta}}) = \mathbf{0} \quad \quad \implies \quad \quad \hat{\boldsymbol{\beta}} = (\mathbf{x}^\text{T} \mathbf{x})^{-1} (\mathbf{x}^\text{T} \mathbf{y}).$$

In the least-squares problem, linear independence of the columns of the design matrix $\mathbb{x}$ ensures that we do not have multiple optima. In the case where we have linearly dependence in the design matrix, this means that there are excess variables that give "ridges" of minimising values, and the parameters are not identified. This is usually dealt with by removing columns of the design matrix until linear independence is obtained.

Best Answer

Related Solutions

Solved – Why is gradient descent required

Solved – Why take the gradient of the moments (mean and variance) when using Batch Normalization in a Neural Network

Related Question