Extending @Dikran Marsupial's answer....
Anna Choromanska and her colleagues in Yan LeCunn's group at NYU, address this in their 2014 AISTATS paper "The Loss Surface of Multilayer Nets". Using random matrix theory, along with some experiments, they argue that:
For large-size networks, most local minima are equivalent and yield similar performance on a test set.
The probability of finding a "bad" (high value) local minimum is non-zero for small-size networks and decreases quickly with networks size.
Struggling to find the global minimum on the training set (as opposed to one of the many good local ones) is not useful in practice and may lead to overfitting.
[From page 2 of the paper]
In this view, there's not a great reason to deploy heavy-weight approaches for finding the global minimum. That time would be better spent trying out new network topologies, features, data sets, etc.
That said, lots of people have thought about augmenting or replacing SGD. For fairly small networks (by contemporary standards), these improved metahuristics do seem to do something Mavrovouniotis and Yang (2016) show that ant colony optimization + backprop beats unmodified backprop on several benchmark data sets (albeit not by much). Rere el al. (2015) use simulated annealing to train a CNN and find it initially performs better on the validation set. After 10 epochs, however, only a a very small (and not-tested-for-significance) difference in performance remains. The faster convergence-per-epoch advantage is also offset by a dramatically larger amount of computation time per epoch, so this is not a obvious win for simulated annealing.
It is possible that these heuristics do a better job of initializing the network and once it has been pointed down the right path, any optimizer will do. Sutskever et al. (2013) from Geoff Hinton's group argue something like this in their 2013 ICML paper.
In principle, both EM and standard optimization approaches can work for fitting mixture distributions. Like EM, convex optimization solvers will converge to a local optimum. But, a variety of optimization algorithms exist for seeking better solutions in the presence of multiple local optima. As far as I'm aware, the algorithm with best convergence speed will depend on the problem.
One benefit of EM is that it naturally produces valid parameters for the mixture distribution on every iteration. In contrast, standard optimization algorithms would need constraints to be imposed. For example, say you're fitting a Gaussian mixture model. A standard nonlinear programming approach would require constraining covariance matrices to be positive semidefinite, and constraining mixture component weights to be nonnegative and sum to one.
To achieve good performance on high dimensional problems, a nonlinear programming solver typically needs to exploit the gradient. So, you'd have to either derive the gradient or compute it with automatic differentiation. Gradients are also needed for constraint functions if they don't have a standard form. Newton's method and related approaches (e.g. trust region methods) need the Hessian as well. Finite differencing or derivative-free methods could be used if the gradient is unavailable, but performance tends to scale poorly as the number of parameters increases. In contrast, EM doesn't require the gradient.
EM is conceptually intuitive, which is a great virtue. This often holds for standard optimization approaches as well. There are many implementation details, but the overall concept is simple. It's often possible to use standard optimization solvers that abstract these details away under the hood. In these cases, a user just has to supply the objective function, constraints, and gradients, and have enough working knowledge to select a solver that's well suited for the problem. But, specialized knowledge is certainly required if it gets to the point where the user has to think about or implement low-level details of the optimization algorithm.
Another benefit of the EM algorithm is that it can be used in cases where some data values are missing.
Also of interest (including the comments):
Best Answer
If you do not carefully choose the range of the initial values for the weights, and if you do not control the range of the values of the weights during training, vanishing gradient would occur which is the main barrier to learning deep networks. The neural networks are trained using the gradient descent algorithm: $$w^{new} := w^{old} - \eta \frac{\partial L}{\partial w}$$ where $L$ is the loss of the network on the current training batch. It is clear that if the $\frac{\partial L}{\partial w}$ is very small, the learning will be very slow, since the changes in $w$ will be very small. So, if the gradients are vanished, the learning will be very very slow.
The reason for vanishing gradient is that during backpropagation, the gradient of early layers (layers near to the input layer) are obtained by multiplying the gradients of later layers (layers near to the output layer). So, for example if the gradients of later layers are less than one, their multiplication vanishes very fast.
With this explanations these are answers to your questions: