I would suggest having some held-out data that forms a validation dataset. You can compute your loss function on the validation dataset periodically (it would probably be too expensive after each iteration, so after each epoch seems to make sense) and stop training once the validation loss has stabilized.
If you're in a purely online setting where you don't have any data ahead of time I suppose you could compute an average loss of the examples in each epoch, and wait for that average loss to converge, but of course that could lead to overfitting...
It looks like Vowpal Wabbit (an online learning system that implements SGD amongst other optimizers) uses a technique called Progressive Cross-Validation which is similar to using a holdout set, but allows you to use more data while training the model, see:
http://hunch.net/~jl/projects/prediction_bounds/progressive_validation/coltfinal.pdf
Vowpal Wabbit has an interesting approach, it computes error metrics after each example, but prints the diagnostics with an exponential backoff, so at first you get frequent updates (to help diagnose early problems), and then less frequent updates as time goes on.
Vowpal Wabbit displays two error metrics, the average progressive loss overall, and the average progressive loss since the last time the diagnostics were printed. You can read some details about the VW diagnostics below:
https://github.com/JohnLangford/vowpal_wabbit/wiki/Tutorial#vws-diagnostic-information
This is a good question and unfortunately unanswered for a long time, it seems that there was a partial answer given just a couple months after you asked this question here that basically just argues that correlation is useful when the outputs are very noisy and perhaps MSE otherwise. I think first of all we should look at the formulas for both.
$$MSE(y,\hat{y}) = \frac{1}{n} \sum_{i=1}^n(y_i - \hat{y_i})^2$$
$$R(y, \hat{y}) = \frac{\sum_{i=1}^n (y_i - \bar{y})(\hat{y_i} - \hat{\bar{y}})}
{\sqrt{\sum ^n _{i=1}(y_i - \bar{y})^2} \sqrt{\sum ^n _{i=1}(\hat{y_i} - \hat{\bar{y}})^2}} $$
Some few things to note, in the case of linear regression we know that $\hat{\bar{y}} = \bar{y}$ because of unbiasedness of the regressor, so the model will simplify a little bit, but in general we can't make this assumption about ML algorithms. Perhaps more broadly it is interesting to think of the scatter plot in $\mathbb{R^2}$ of $ \{ y_i, \hat{y_i}\} $ correlation tells us how strong the linear relationship is between the two in this plot, and MSE tells us how far from the diagonal each point is. Looking at the counter examples on the wikipedia page you can see there are many relationships between the two that won't be represented.
I think generally correlation is tells similar things as $R^2$ but with directionality, so correlation is somewhat more descriptive in that case. In another interpretation, $R^2$ doesn't rely on the linearity assumption and merely tells us the percentage of variation in $y$ that's explained by our model. In other words, it compares the model's prediction to the naive prediction of guessing the mean for every point. The formula for $R^2$ is:
$$R^2(y,\hat{y}) = 1 - \frac{\sum_{i=1}^n (y_i-\hat{y})^2}{\sum_{i=1}^n (y_i-\bar{y})^2}$$
So how does
$R$ compare to
$R^2$? Well it turns out that
$R$ is more immune to scaling up of one of the inputs this has to do with the fact that
$R^2$ is homogenous of degree 0 only in both inputs, where
$R$ is homogenous of degree 0 in either input. It's a little less clear what this might imply in terms of machine learning, but it might mean that the model class of
$\hat{y}$ can be a bit more flexible under correlation. This said, under some additional assumptions, however, the two measures are equal, and you can read more about it here: http://www.win-vector.com/blog/2011/11/correlation-and-r-squared/.
Finally a last important thing to note is that $R$ and $R^2$ do not measure goodness of fit around the $y=\hat{y}$ line. It is possible (although odd) to have a predictor be linearly shifted away from the $y=\hat{y}$ line with an $R^2$ of one, but the predictions would still be "bad". In this case, MSE would be more informative in finding the better predictor than $R^2$ but perhaps this is more of a pathological case than an issue with using $R$ and $R^2$ as metrics.
Best Answer
I think "decoupled function" is meant in the sense of this preprint: https://arxiv.org/pdf/1805.08479.pdf where they discuss multivariate functions being "decoupled" into the sum of univariate functions.
If you have $N$ data points, your loss function should be written as the sum (or average) of $N$ univariate functions, which are functions only of the predicted value at that point. For example, sum of squared errors is decoupled, since the total loss is just the sum of the loss at each point.
This would be violated if you have some loss function which considers explicit dependence between the observations. The condition is sort of like assuming your data are IID - it's the IID assumption which allows us to work with decoupled loss functions.
For example, if you tried to build a model with this loss function: $$\sum_i\sum_j (\hat{y_i} - \hat{y_j})^2$$, which minimizes the difference between all predictions, and says nothing else about what they should be. In order to calculate the gradient of $\hat{y_i}$ as a function of the parameters in the model, you'd also need to know the value of every $\hat{y_j}$. So backprop isn't possible.