Yes, you can say that the algorithm is converging because it is increasing the objective value on average.
The most tricky part of stochastic gradient descent (SGD) is the 'learning rate'. Common choices are 1/t, 1/sqrt(t), D/(G*t) where t is the iteration number, D is the max diameter of your feasible set, and G is the infinity-norm of the current gradient. You should experiment with these. You can even split your data into two for cross validation of the learning rate.
Another thing you can try is mini-batch SGD. In this variant, instead of using a single data point to compute the gradient, you use a batch of points like 10,20. This way, the objective-versus-trial plot (the first plot) will be smoother and will look like the second plot you have.
The applicability of batch or stochastic gradient descent really depends on the error manifold expected.
Batch gradient descent computes the gradient using the whole dataset. This is great for convex, or relatively smooth error manifolds. In this case, we move somewhat directly towards an optimum solution, either local or global. Additionally, batch gradient descent, given an annealed learning rate, will eventually find the minimum located in it's basin of attraction.
Stochastic gradient descent (SGD) computes the gradient using a single sample. Most applications of SGD actually use a minibatch of several samples, for reasons that will be explained a bit later. SGD works well (Not well, I suppose, but better than batch gradient descent) for error manifolds that have lots of local maxima/minima. In this case, the somewhat noisier gradient calculated using the reduced number of samples tends to jerk the model out of local minima into a region that hopefully is more optimal. Single samples are really noisy, while minibatches tend to average a little of the noise out. Thus, the amount of jerk is reduced when using minibatches. A good balance is struck when the minibatch size is small enough to avoid some of the poor local minima, but large enough that it doesn't avoid the global minima or better-performing local minima. (Incidently, this assumes that the best minima have a larger and deeper basin of attraction, and are therefore easier to fall into.)
One benefit of SGD is that it's computationally a whole lot faster. Large datasets often can't be held in RAM, which makes vectorization much less efficient. Rather, each sample or batch of samples must be loaded, worked with, the results stored, and so on. Minibatch SGD, on the other hand, is usually intentionally made small enough to be computationally tractable.
Usually, this computational advantage is leveraged by performing many more iterations of SGD, making many more steps than conventional batch gradient descent. This usually results in a model that is very close to that which would be found via batch gradient descent, or better.
The way I like to think of how SGD works is to imagine that I have one point that represents my input distribution. My model is attempting to learn that input distribution. Surrounding the input distribution is a shaded area that represents the input distributions of all of the possible minibatches I could sample. It's usually a fair assumption that the minibatch input distributions are close in proximity to the true input distribution. Batch gradient descent, at all steps, takes the steepest route to reach the true input distribution. SGD, on the other hand, chooses a random point within the shaded area, and takes the steepest route towards this point. At each iteration, though, it chooses a new point. The average of all of these steps will approximate the true input distribution, usually quite well.
Best Answer
Stochastic Gradient Descent is preceded by Stochastic Approximation as first described by Robbins and Monro in their paper, A Stochastic Approximation Method. Kiefer and Wolfowitz subsequently published their paper, *Stochastic Estimation of the Maximum of a Regression Function* which is more recognizable to people familiar with the ML variant of Stochastic Approximation (i.e Stochastic Gradient Descent), as pointed out by Mark Stone in the comments. The 60's saw plenty of research along that vein -- Dvoretzky, Powell, Blum all published results that we take for granted today. It is a relatively minor leap to get from the Robbins and Monro method to the Kiefer Wolfowitz method, and merely a reframing of the problem to then get to Stochastic Gradient Descent (for regression problems). The above papers are widely cited as being the antecedents of Stochastic Gradient Descent, as mentioned in this review paper by Nocedal, Bottou, and Curtis, which provides a brief historical perspective from a Machine Learning point of view.
I believe that Kushner and Yin in their book Stochastic Approximation and Recursive Algorithms and Applications suggest that the notion had been used in control theory as far back as the 40's, but I don't recall if they had a citation for that or if it was anecdotal, nor do I have access to their book to confirm this.