[Math] Gradient descent on non-convex function works. How

gradient descentnon-convex-optimizationnumerical optimizationoptimizationsvd

For Netflix Prize competition on recommendations one method used a stochastic gradient descent, popularized by Simon Funk who used it to solve an SVD approximately. The math is better explained here on pg 152. A rating is predicted by

$$
\hat{r}_{ui} = \mu + b_i + b_u + q_i^Tp_u
$$

Regularized square error is

$$
\min_{b*,q*,p*} \sum_{u,i} (r_{ui} – \mu – b_i – b_u – q_i^Tp_u)^2 +
\lambda_4 (b_i^2 + b_u^2 + ||q_i||^2 + ||p_u||^2)
$$

If partial derivatives are taken according to each variable, and through a constant for the update rule we have equations such as

$$
b_u \leftarrow b_u + \gamma (e_{ui} – \lambda_4 \cdot b_u)
$$

[The rest is skipped]. My question is this, paper here (Wayback Machein) states square error function above non-convex because of $q_i^Tp_u$ and both variables here are unknown which, of course, is correct. Then how do we guarantee stochastic gradient descent scheme described in all papers will find a global minimum? I read somewhere SGD can be used to solve non-convex functions, I'd like to hear about some details on how. SGD SVD method described in the links works well in practice.

Also, alternating least squares can be applied to solve the SVD problem described above according to IEEE Koren paper. This method alternates between holding one or the other variable constant thereby creating convex problems in the process. I wonder if SGD, while it goes through dimensions, data points one-by-one also, in a way, creates convex sub-problems as it proceeds.

Maybe the answer is simply this presentation by LeCun, stochastic gradient on non-convex functions have no theoretical guarantees, but this doesn't mean we should not use them. "When Empirical Evidence suggests a fact for which we don't have
theoretical guarantees, it just means the theory is inadequate".

Best Answer

I was able to reach Brandyn Webb (aka S. Funk) and he clarified some of my questions. The ideas that led to SGD SVD can be found in this paper

http://courses.cs.washington.edu/courses/cse528/09sp/sanger_pca_nn.pdf

where Sanger, building on work of Oja

http://users.ics.aalto.fi/oja/Oja1982.pdf

talks about finding N top eigenvectors through a method called Generalized Hebbian Algorithm (GHA) using a neural network with probability 1. That is, training this particular network ends up doing PCA (very cool).

Webb was researching NNs at the time (way before Netflix);

I was looking at a reciprocal arrangement between neurons in cortex and had worked out an implied learning algorithm which struck me as something that might have an analytical solution. [..] My original incremental formulation (not too different) from a few years before that was, rather than a gradient, just an expansion of matrix multiplication based on the matrix in question being the sum of outer products of individual sample vectors, and was, therefore, exactly the same as a common numeric approach to finding eigenvectors (just keep squaring the matrix until it settles to many copies of your first eigenvector [i.e. the power method]; then repeat with that one removed, and so on) -- for sufficiently small learning rate, anyway... The result is practically the same as the gradient derivation; it is my assumption that it has the same properties.

Later Gorrell and Webb realized the formulation looked like it was solving an SVD problem (again before Netflix Prize). Webb first started using the SGD derivation for Netflix, he mentioned that he "went for the direct gradient method on the spur of the moment because it was faster to derive that on a napkin than to find my old paper".

The question can be asked "how much SVD is Funk SVD?". Some on the Net call it "Funk's heuristic approach to SVD" and "in missing data form it's a factor model not SVD". Another says "in the case of a partially observed matrix [..] the factorized solution U V'is not truely an SVD – there is no constraint that the matrices U and V be orthogonal".

Webb does not agree, to the first two

I'd personally probably just call it incremental approximate SVD. Factor model is a broad category; it's useful to retain reference to the connection to SVD even if it's no longer SVD since it helps (down the road) to understand some of the properties.

for the last,

Any regularization, whether implicit or explicit (implicit via not over-training) likely violates orthogonality, but if the underlying algorithm is run to convergence with adequately low learning rate and no regularization, it will settle on orthogonal matrices. Look to Sanger's GHA proof for this, but it's trivially understandable as reverse inhibition removing projections of existing vectors from the dataset which means the residual gradient necessarily slides into a subspace that lacks any projection on the other vectors (i.e., orthogonal).

Once you add regularization, or any of the non-linearities that the incremental approach allows you to conveniently insert, then all bets are off. However, assuming the primary gradient is still the majority influence, you can guarantee a degree of orthogonality, ergo the "approximate SVD" depiction.

Overall, we can say there are connections between GHA, the so-called "iterative power method", gradient solution and SVD; It is because of this, perhaps, Webb could delve into a solution without being scared off by the non-convexity of the problem.

No matter what route was taken during the invention of this method however, we can look at the error function for recommendations, and through its gradients one can see SGD can minimize it, which is proven by empirical tests. The error function is non-convex, however as Bottou and LeCun suggest absence of theory should not stop us using a method. Plus people started to look at stochastic approaches, their theoretical guarantees much closer now, as seen here

http://arxiv.org/pdf/1308.3509

The paper above also talks about power method and SGD connection BTW.

Note: power method is used to find a single eval/evec, that is evec for the largest eval, however it can be used to find all other evecs by "removing" the newlyfound evec from matrix A (through a process called deflation), and re-running the power method on A again which finds the second eigenvector, and so on.

http://www.maths.qmul.ac.uk/~wj/MTH5110/notes/MAS235_lecturenotes1.pdf

http://heim.ifi.uio.no/~tom/powerandqrslides.pdf

Related Question