I suggest you should determine both the ARMA and the GARCH parts simultaneously. If you determine the ARMA part first by temporarily ignoring GARCH, this will lead to inconsistent ARMA parameter estimates (unless the MA part is missing) and probably suboptimal selection of autoregressive and moving average lag orders -- because ACF and PACF confidence bounds will be invalid given the neglected GARCH-type residuals. Also, the Ljung-Box test will not have the regular null distribution under GARCH-type residuals, thus you cannot rely on it for testing how well the ARMA model captures the patterns in the data.
These issues have been discussed in earlier posts here, here and to some extent also here.
In short, you should select models using AIC and/or out-of-sample fit criteria and view the rejected hypothesis as a suggestion to consider other types of models.
When using this class of time series models researchers are usually interested in accurate prediction\forecasting. Since AIC measures how well a model predicts the data in-sample, it operates as a fair means of model selection in this case (you may also want to test how well the models fit out-of-sampleā¦more on that below).
However, just because a particular model has the lowest AIC does not mean that that model is correctly specified or that it approximates the true data generating process well. It could be that all the models you proposed were poor choices, or that the true process FTSE follows is so complex that practically every reasonable model will be rejected given enough data. AIC provides no information on this point which is where hypothesis testing can come in.
Under the assumptions of standard ARMA-GARCH, the residuals should be homoscedastic and more generally iid normal. Your hypothesis test suggests that your residuals are not homoscedastic and, in turn, that your ARMA-GARCH model may be miss specified. On this note you may want to consider alternative specifications for the volatility process including other variants of GARCH models, i.e. EGARCH, GJR-GARCH, TGARCH, AVGARCH, NGARCH, GARCH-M, etc. and/or stochastic volatility models. It is highly likely that one of these models will offer a lower AIC value and produce residuals which cannot be rejected for homoscedasticity.
One important thing to note though is that no model will be perfect, especially for something like the FTSE 100. The true data generating process driving a large financial index like this is impossibly complex, so pretty much every model you propose will be false. For this reason, it can be argued that any meaningful hypothesis you do not reject is a reflection of insufficient data or lack of statistical power rather than evidence supporting one model over others.
One way to partially resolve this dilemma is to use out-of-sample fit as opposed to or in conjunction with AIC. A simple example would be to fit the model using only the first 80% or 90% of the data and using the resulting coefficient estimates to obtain a log-likelihood for the remaining 20%-10% portion of the data. The model with the highest log-likelihood would be preferred. If the ARMA-GARCH model is truly misspecified in a way that impairs its forecasting performance, then an out-of-sample fit will help expose it.
Best Answer
Given a sufficiently large sample, I do not see a reason to mitigate the sensitivity in the first place. Sensitivity is desirable. It allows the model to reflect regularities in the data, and this is what we use models for (not only GARCH but also more generally).
On the other hand, if your sample is small and the results vary a lot among similar model specifications, you may be heavily overfitting the data. Try some more parsimonious model specifications instead.
Also note that ARMA models with different lag orders and different coefficient values can nevertheless produce similar patterns. Thus you may gain more insight by looking at and comparing impulse-response functions (IRFs) of the ARMA models than their coefficient values.
I do not think there is. The problem is not specific to GARCH models, however. It is common to a wide range of statistical models.
In summary, I would try different models, assess their assumptions (by running diagnostics on standardized residuals) and pick a model that offers a good trade-off between statistical adequacy (based on diagnostics) and parsimony (based on model complexity). AIC is one measure that can aid you in that.