This is indeed something often glossed over.
Some people are doing something a bit cheeky: holding out a proportion of the words in each document, and giving using predictive probabilities of these held-out words given the document-topic mixtures as well as the topic-word mixtures. This is obviously not ideal as it doesn't evaluate performance on any held-out documents.
To do it properly with held-out documents, as suggested, you do need to "integrate over the Dirichlet prior for all possible topic mixtures". http://people.cs.umass.edu/~wallach/talks/evaluation.pdf reviews a few methods for tackling this slightly unpleasant integral. I'm just about to try and implement this myself in fact, so good luck!
In general, building an ARMA-GARCH model in a stepwise fashion based on diagnostics such as ACF, PACF and Ljung-Box is problematic because the latter do not have standard null distributions when applied on returns when the conditional variance is nonconstant; or on squared returns when the conditional mean is nonconstant. Thus the following will not work exactly as you expect it to (but hopefully the distortion will not be too large and you could still trust the results to some extent):
I'm fitting a ARIMA-GARCH model to my hedge fund index daily log return series. I used ACF, PACF, Ljung-Box test and Archtest to check for autocorrelation and conditional heteroskedasticity. As the ACF and PACF for return itself don't show significant autocorrelation, (but they do for squared return), also suggested by the Ljung-Box test with h=0, so I exclude the autocorrelation in the mean process. But to double check, I use an ARIMA(1,0,1), and for both coefficients for AR and MA terms are not statistically significant. So I exclude them and go for only a GARCH model.
Going forward,
I did check the other similar questions about garch lag selection (for example, here) and it seems like that when it comes to the function of predicting, it's better to choose the one with lowest AIC rather than BIC. So I first compare the AIC then I further check using likelihood ratio test.
AIC, BIC and LR all address different questions and serve different goals. You should not expect all of them to point to the same direction, and you should choose the appropriate one based on your modelling goal. If the goal is forecasting, AIC is the most relevant choice.
Regarding your Q1, experience in finance tells us that high-order GARCH models do not tend to beat low-order GARCH models. I would stick to a relatively parsimonious model unless I had reasons to believe the time series is somehow special and unlike other financial time series. I do not see a sound theoretical reason to select a model that has higher AIC than another model (when there are not that many models being compared, like in your case), but the experience in finance points to a different solution.
Regarding your Q2, see above.
Regarding your Q3, it does matter that you consider the full model. Considering only part of the model does not make sense. (You could construct examples where you choose really poor models over much better models only because you happen to look at part of the picture instead of the whole picture.)
Best Answer
This has less to do with perplexity, and more to do with cross-validation and test perplexity specifically. Here's a fuller excerpt from the paper, emphasis mine:
I.e, a lower perplexity indicates that the data are more likely. As referenced in your equation, the authors are calculating test set perplexity. In other words, they're estimating how well their model generalizes by testing it on unseen data. Incidentally, this allows them a practical comparison with competing models whose parameter spaces could be vastly different.
It's worth noting that your intuition—about higher log-likelihood or lower perplexity and overfitting—would well suit a training set. As overfitting occurs, a curve of training and test perplexity should resemble the learning curve plots you're probably familiar with: Training perplexity should continue decreasing but flatten out as overfitting occurs, while test perplexity should decrease and then increase in a parabolic sort of shape.