My understanding is that AIC, DIC, and WAIC are all estimating the same thing: the expected out-of-sample deviance associated with a model. This is also the same thing that cross-validation estimates. In Gelman et al. (2013), they say this explicitly:
A natural way to estimate out-of-sample prediction error is cross-validation (see Vehtari and Lampinen, 2002, for a Bayesian perspective), but researchers have always sought alternative mea- sures, as cross-validation requires repeated model fits and can run into trouble with sparse data. For practical reasons alone, there remains a place for simple bias corrections such as AIC (Akaike, 1973), DIC (Spiegelhalter, Best, Carlin, and van der Linde, 2002, van der Linde, 2005), and, more recently, WAIC (Watanabe, 2010), and all these can be viewed as approximations to different versions of cross-validation (Stone, 1977).
BIC estimates something different, which is related to minimum description length. Gelman et al. say:
BIC and its variants differ from the other information criteria considered here in being motivated not by an estimation of predictive fit but by the goal of approximating the marginal probability density of the data, p(y), under the model, which can be used to estimate relative posterior probabilities in a setting of discrete model comparison.
I don't know anything about the other information criteria you listed, unfortunately.
Can you use the AIC-like information criteria interchangeably? Opinions may differ, but given that AIC, DIC, WAIC, and cross-validation all estimate the same thing, then yes, they're more-or-less interchangeable. BIC is different, as noted above. I don't know about the others.
Why have more than one?
AIC works well when you have a maximum likelihood estimate and flat priors, but doesn't really have anything to say about other scenarios. The penalty is also too small when the number of parameters approaches the number of data points. AICc over-corrects for this, which can be good or bad depending on your perspective.
DIC uses a smaller penalty if parts of the model are heavily constrained by priors (e.g. in some multi-level models where variance components are estimated). This is good, since heavily constrained parameters don't really constitute a full degree of freedom. Unfortunately, the formulas usually used for DIC assume that the posterior is essentially Gaussian (i.e. that it is well-described by its mean), and so one can get strange results (e.g. negative penalties) in some situations.
WAIC uses the whole posterior density more effectively than DIC does, so Gelman et al. prefer it although it can be a pain to calculate in some cases.
Cross-validation does not rely on any particular formula, but it can be computationally prohibitive for many models.
In my view the decision about which one of the AIC-like criteria to use depends entirely on these sorts of practical issues, rather than a mathematical proof that one will do better than the other.
References:
Gelman et al. Understanding predictive information criteria for Bayesian models. Available from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.295.3501&rep=rep1&type=pdf
The AIC is sensitive to the sample size used to train the models. At small sample sizes, "there is a substantial probability that AIC will select models that have too many parameters, i.e. that AIC will overfit". [1] The reference goes on to suggest AICc in this scenario, which introduces an extra penalty term for the number of parameters.
This answer by Artem Kaznatcheev suggests a threshold of $n/K < 40$ as cutoff point for whether to use AICc or not, based on Burnham and Anderson. Here $n$ signifies the number of samples and $K$ the number of model parameters. Your data has 234 rows available (listed on the webpage you linked). This would indicate that the cutoff exists at roughly 6 parameters, beyond which you should consider AICc.
[1] https://en.m.wikipedia.org/wiki/Akaike_information_criterion#modification_for_small_sample_size
Best Answer
According to Wikipedia, the AIC can be written as follows: $$ 2k - 2 \ln(\mathcal L) $$ As the BIC allows a large penalization for complex models there are situations in which the AIC will hint that you should select a model that is too complex, while the BIC is still useful. The BIC can be written as follows: $$ -2 \ln(\mathcal L) + k \ln(n) $$ So the difference is that the BIC penalizes for the size of the sample. If you do not want to penalize for the sample there
A quick explanation by Rob Hyndman can be found here: Is there any reason to prefer the AIC or BIC over the other? He writes:
Edit: One example can be found in Time Series analysis. In VAR models the AIC (as well as its corrected version the AICc) often take to many lags. Therefore one should primarily look at the BIC when choosing the number of lags of a VAR Modell. For further information you can read chapter 9.2 from Forecasting- Principles and Practice by Rob J. Hyndman and George Athanasopoulos.