Optimization – Usefulness of Linear Regression Convexity Without Closed Form Solution

mathematical-statisticsmatrix inverseoptimizationregression

The optimisation problem in linear regression, $f(\beta) = ||y-X\beta||^2$ is convex (as it is a quadratic function), and when $(X^TX)$ is invertible, we have a unique solution which we can calculate by the given closed form $\beta = (X^T X)^{-1} X^Ty$. However, how is convexity useful in cases where there is no closed form solution. When we have for instance infinite many solutions, the fact that local solutions are also global minima seems not much of a help?

Best Answer

When $(X^TX)$ is not invertible there is not one solution but several: an affine subspace. But they are still closed form solutions in a way. They are solutions of the linear system: $(X^TX)\beta=X^Ty$. Solving this system is not fundamentally more complicated than inverting a matrix I think.

It is also possible to solve it with a convex optimization algorithm. But then there is not a single minimum since the function reaches a constant minimum on an affine subspace: imagine a straight line as the bottom of a valley (a straight river). The function is not strictly convex but just convex. I think most convex optimization algorithms will converge to one of the solutions: finish at any place in the affine subspace.

So, the case when the matrix is not invertible is not so much special in terms of matrix method versus convex optimization algorithm. It's just that linear regression allows a special simple solution with matrices or linear systems, when general problems do not, and you have to find the minimum with an iterative method: an optimization algorithm.

Note that there are cases where, in spite of apparent simplicity, inverting the matrix has a much higher algorithmic complexity than a reasonably precise convex optimization algorithm. This is usually when there are lots of features (hundreds to millions). That's why people use convex optimization methods even for linear regression.

Finally, when the matrix is not invertible, it is usually that there are not enough data to really estimate $\beta$ with a realistic precision. The solution is extremely over-fitted. People will then use ridge regularization. The solution is $\beta=(X^TX+\lambda I)^{-1}X^Ty$. The matrix $(X^TX+\lambda I)$ is always invertible ($\lambda>0$).