The applicability of batch or stochastic gradient descent really depends on the error manifold expected.
Batch gradient descent computes the gradient using the whole dataset. This is great for convex, or relatively smooth error manifolds. In this case, we move somewhat directly towards an optimum solution, either local or global. Additionally, batch gradient descent, given an annealed learning rate, will eventually find the minimum located in it's basin of attraction.
Stochastic gradient descent (SGD) computes the gradient using a single sample. Most applications of SGD actually use a minibatch of several samples, for reasons that will be explained a bit later. SGD works well (Not well, I suppose, but better than batch gradient descent) for error manifolds that have lots of local maxima/minima. In this case, the somewhat noisier gradient calculated using the reduced number of samples tends to jerk the model out of local minima into a region that hopefully is more optimal. Single samples are really noisy, while minibatches tend to average a little of the noise out. Thus, the amount of jerk is reduced when using minibatches. A good balance is struck when the minibatch size is small enough to avoid some of the poor local minima, but large enough that it doesn't avoid the global minima or better-performing local minima. (Incidently, this assumes that the best minima have a larger and deeper basin of attraction, and are therefore easier to fall into.)
One benefit of SGD is that it's computationally a whole lot faster. Large datasets often can't be held in RAM, which makes vectorization much less efficient. Rather, each sample or batch of samples must be loaded, worked with, the results stored, and so on. Minibatch SGD, on the other hand, is usually intentionally made small enough to be computationally tractable.
Usually, this computational advantage is leveraged by performing many more iterations of SGD, making many more steps than conventional batch gradient descent. This usually results in a model that is very close to that which would be found via batch gradient descent, or better.
The way I like to think of how SGD works is to imagine that I have one point that represents my input distribution. My model is attempting to learn that input distribution. Surrounding the input distribution is a shaded area that represents the input distributions of all of the possible minibatches I could sample. It's usually a fair assumption that the minibatch input distributions are close in proximity to the true input distribution. Batch gradient descent, at all steps, takes the steepest route to reach the true input distribution. SGD, on the other hand, chooses a random point within the shaded area, and takes the steepest route towards this point. At each iteration, though, it chooses a new point. The average of all of these steps will approximate the true input distribution, usually quite well.
I think it usually is a matter of how simple/easy it is to work out the gradient of the smooth part of the function and/or the proximal operator of the penalty.
Sometimes, it is a lot more simple to find an exact solution of the problem in the case with one single variable (or a block or variables), than it is to work it out for all variables simultaneously. Othertimes it is just too expensive to compute the gradient compared to the individual derivatives. Also, the convergence of coordinate descent is the same as for ista, $1/k^2$, where $k$ is the number of iterations, but it may sometimes perform better compared to both ISTA and FISTA, see e.g. http://statweb.stanford.edu/~tibs/comparison.txt.
Such things will influence the choice of coordinate descent vs. ISTA/FISTA, for instance.
Best Answer
Your question mentions that a diagonal $P$ gives you a parameter-dependent learning rate, which is like normalizing your input, except it assumes that your input has diagonal covariance.
In general using a preconditioning matrix $P$ is equivalent to normalizing $x$. Let the covariance of $x$ be $\Sigma = L L^T$ then $\tilde{x} = L^{-1}x $ is the normalized version of $x$.
$$ \Delta \tilde{x} = \nabla_\tilde{x} F(x) = L^T \nabla_x F(x) $$
so
$$ \Delta x = L \Delta \tilde{x} = LL^{T} \nabla_{x} F(x) = \Sigma \nabla_{x} F(x) $$
Doing this makes your objective (more) isotropic in the parameter space. It's the same as a parameter-dependent learning rate, except that your axes don't necessarily line up with the coordinates.
Here's an image where you can see a situation where you would need one learning rate on the line $y = x$, and another on the line $y=-x$, and how the transformation $L = ( \sigma_1 + \sigma_3 ) \operatorname{diag}(1, \sqrt{10})$ solves that problem.
Another way you could look this is that Newton's method would give you an optimization step: $$ x_{n+1} = x_n - \gamma_n [Hf|_{x_n}]^{-1} \nabla f(x_n) $$ And approximating the hessian as constant near the minimum with $P \approx Hf|_{x^\star} $ brings you closer to the fast convergence provided by Newton's method, without having to calculate the Hessian or make more computationally expensive approximations of the Hessian that you would see in quasi-Newton methods.
Note that for a normal distribution, the hessian of the log-loss is $ H = \Sigma^{-1} $, and these two perspectives are equivalent.