One way of deciding how to run variational MLE is to look at how the experts do it.
In Blei's LDA code (http://www.cs.princeton.edu/~blei/lda-c/lda-c-dist.tgz), within the "run_em" function, the "lda_inference" function (inside "doc_e_step") repeatedly maximizes with respect to each $q$ distribution until convergence. After the $q$'s converge, the algorithm maximizes with respect to the parameters in "lda_mle".
The justification for this order is that by maximizing with respect to the $q$'s until convergence you get a better estimate of the expectations of hidden variables (or marginalized parameters) needed to maximize with respect to the parameters.
In standard EM, of course, the expectations you are computing are exact - which is the main difference between standard and variational EM - so this is not a concern.
From the perspective of EM as a maximization algorithm over the function $F(q,\theta)$ (www.cs.toronto.edu/~radford/ftp/emk.pdf) or from the perspective of maximizing the evidence lower bound, it is not clear that maximizing over the q's until convergence is best in terms of computationally efficiency because the algorithm will reach a local maximum no matter the order of maximization steps.
The problem of finding the hyperparameters is called evidence approximation. It is nicely explained in Bishop's book (page 166), or else in this paper, in great detail.
The idea is that your problem has the canonical form (the predictive distribution for a new sample),
$$
p(t|\mathbf{t}) = \int p(t|\mathbf{w},\alpha) p(\mathbf{w}|\mathbf{t},\alpha,\beta)p(\alpha,\beta|\mathbf{t}) d\mathbf{w} d\alpha d\beta
$$
where $\mathbf{t}$ is your training data, $\alpha,\beta$ are hyperparameters, and $\mathbf{w}$ are your weights.
First, computing this integral is expensive or maybe even intractable, and has an additional difficulty: $p(\alpha,\beta|\mathbf{t})$. This term tells us that we need to integrate over the ensemble of interpolators. In practice means that you would train your ensemble, that is, each of the $p(\mathbf{t}|\alpha,\beta)$, and using Bayes' theorem,
$$
p(\alpha,\beta|\mathbf{t}) \propto p(\mathbf{t}|\alpha,\beta) p(\alpha,\beta)
$$
you could calculate each term applying Bayes. And finally sum over all of them.
The evidence framework assumes (in the referred paper validity conditions for this assumption are given) that $p(\alpha,\beta|\mathbf{t})$ a dominant peak at some values $\hat{\alpha},\hat{\beta}$. Under this assumption you substitute your integral by a point estimation at the peak, namely,
$$
p(t|\mathbf{t}) \approx \int p(t|\mathbf{w},\alpha) p(\mathbf{w}|\mathbf{t},\hat{\alpha},\hat{\beta})
$$
If the prior is relatively flat, then the problem of finding $\hat{\alpha}$ and $\hat{\beta}$ finally reduces to maximizing the likelihood $p(\mathbf{t}|\alpha,\beta)$. In your case the integral term has a closed form solution (is also Gaussian).
P.S. In statistics this method is known as empirical Bayes. If you google for it, you shall find a few references. I find this one to be very really nice, since it works easier problems in detail, and carefully introduces all necessary terms.
Best Answer
At least for (loopy) Belief Propagation (BP), the formula used to compute the partition function only holds at the convergence point. Note that in the following I'm talking about BP and the Bethe approximation. But similar things hold for Generalized Belief Propagation (GBP) and the Kikuchi approximation. Expectation Propagation is a special case of GBP (see one of the references).
Reasoning
Yedidia showed that the stationary points of the Bethe approximation to the free energy are exactly fixed points of the BP algorithm.
The Bethe approximation is a variational approximation to the true free energy of the problem (negative logarithm of the partition function).
I suppose you use exactly this correspondence to calculate your likelihood. The formula is a sum of three terms involving entropies of variables, local entropies of factor nodes and local expectations of factor nodes. If this is the case, then it is unclear how the value you get by applying this formula to a non-converged BP-run relates to the true likelihood.
The following sentence is taken from Yedidia et al.:
This basically means that in general the intermediate (non-converged) states of BP do not correspond to proper beliefs and are thus some kind of garbage. (Note added: those aren't proper beliefs even when converged, see next paragraph).
Addendum
I have a bit more experience now. Before convergence, some of the constraints of the variational problem are violated. So the current message values describe a "distribution" that is even less consistent than guaranteed by the usual constraints (remember that for EP, only the expected values have to match which means we do not need to be within the marginal polytope). This means the calculated log-likelihood is even less of a true lower bound to the true log-likelihood.
So you should really take the value at convergence.
References
Probabilistc Graphical Models: Principles And Techniques by Koller and Friedman
the seminal papers by Yedidia, Freeman and Weiss. "Constructing Free-Energy Approximations and Generalized Belief Propagation Algorithms" is a good overview
Welling, Max and Minka, Thomas P and Teh, Yee Whye, Structured region graphs: Morphing EP into GBP, 2012