It seems that k-fold cross-validation error is very sensitive to the type of performance measure. It also has an error in itself because the training and validation sets are chosen randomly.
I think you've discovered the high variance of performance measures that are proportions of case counts such as $\frac{\text{# correct predictions}}{\text{# test cases}}$. You try to estimate e.g. the probability that your classifier returns a correct answer. From a statistics point of view, that is described as a Bernoulli trial, leading to a binomial distribution. You can calculate confidence intervals for binomial distributions and will find that they are very wide. This of course limits your ability to do model comparison.
With resampling validation schemes such as cross validation you have an additional source of variation: the instability of your models (as you build $k$ surrogate models during each CV run)
Moreover, changing the number of folds gives me different optimal parameter values.
That is to be expected due to the variance. You may have an additional effect here: libSVM splits the data only once if you use their built-in cross validation for tuning. Due to the nature of SVMs, if you built the SVM with identical training data and slowly vary the parameters, you'll find that support vectors (and consequently accuracy) jumps: as long as the SVM parameters are not too different, it will still choose the same support vectors. Only when the paraters are changed enough, suddenly different support vectors will result. So evaluating the SVM parameter grid with exactly the same cross validation splits may hide variability, which you see between different runs.
IMHO the basic problem is that you do a grid search, which is an optimization that relies on a reasonably smooth behaviour of your target functional (accuracy or whatever else you use). Due to the high variance of your performance measurements, this assumption is violated. The "jumpy" dependence of the SVM model also violates this assumption.
Accuracy metrics for cross validation may be overly optimistic. Usually anything over a 2-fold cross-validation gives me 100% accuracy. Also, the error rate is discretized due to small sample size. Model selection will often give me the same error rate across all or most parameter values.
That is to be expected given the general problems of the approach.
However, usually it is possible to choose really extreme parameter values where the classifier breaks down. IMHO the parameter ranges where the SVMs work well is important information.
In any case you absolutely need an external (double/nested) validation of the performance of the model you choose as 'best'.
I'd probably do a number of runs/repetitions/iterations of an outer cross validation or an outer out-of-bootstrap validation and give the distribution of
- hyperparameters for the "best" model
- reported performance of the tuning
- observed performance of outer validation
The difference between the last two is an indicator of overfitting (e.g. due to "skimming" the variance).
When writing a report, how would I know that a classification is 'good' or 'acceptable'? In the field, it seems like we don't have something like a goodness of fit or p-value threshold that is commonly accepted. Since I am adding to the data iteratively, I would like to know when to stop- what is a good N where the model does not significantly improve?
(What are you adding? Cases or variates/features?)
First of all, if you do an iterative modeling, you either need to report that due to your fitting procedure your performance is not to be taken seriously as it is subject to an optimistic bias. The better alternative is to do a validation of the final model. However, the test data of that must be independent of all data that ever went into training or your decision process for the modeling (so you may not have any such data left).
My understanding is that AIC, DIC, and WAIC are all estimating the same thing: the expected out-of-sample deviance associated with a model. This is also the same thing that cross-validation estimates. In Gelman et al. (2013), they say this explicitly:
A natural way to estimate out-of-sample prediction error is cross-validation (see Vehtari and Lampinen, 2002, for a Bayesian perspective), but researchers have always sought alternative mea- sures, as cross-validation requires repeated model fits and can run into trouble with sparse data. For practical reasons alone, there remains a place for simple bias corrections such as AIC (Akaike, 1973), DIC (Spiegelhalter, Best, Carlin, and van der Linde, 2002, van der Linde, 2005), and, more recently, WAIC (Watanabe, 2010), and all these can be viewed as approximations to different versions of cross-validation (Stone, 1977).
BIC estimates something different, which is related to minimum description length. Gelman et al. say:
BIC and its variants differ from the other information criteria considered here in being motivated not by an estimation of predictive fit but by the goal of approximating the marginal probability density of the data, p(y), under the model, which can be used to estimate relative posterior probabilities in a setting of discrete model comparison.
I don't know anything about the other information criteria you listed, unfortunately.
Can you use the AIC-like information criteria interchangeably? Opinions may differ, but given that AIC, DIC, WAIC, and cross-validation all estimate the same thing, then yes, they're more-or-less interchangeable. BIC is different, as noted above. I don't know about the others.
Why have more than one?
AIC works well when you have a maximum likelihood estimate and flat priors, but doesn't really have anything to say about other scenarios. The penalty is also too small when the number of parameters approaches the number of data points. AICc over-corrects for this, which can be good or bad depending on your perspective.
DIC uses a smaller penalty if parts of the model are heavily constrained by priors (e.g. in some multi-level models where variance components are estimated). This is good, since heavily constrained parameters don't really constitute a full degree of freedom. Unfortunately, the formulas usually used for DIC assume that the posterior is essentially Gaussian (i.e. that it is well-described by its mean), and so one can get strange results (e.g. negative penalties) in some situations.
WAIC uses the whole posterior density more effectively than DIC does, so Gelman et al. prefer it although it can be a pain to calculate in some cases.
Cross-validation does not rely on any particular formula, but it can be computationally prohibitive for many models.
In my view the decision about which one of the AIC-like criteria to use depends entirely on these sorts of practical issues, rather than a mathematical proof that one will do better than the other.
References:
Gelman et al. Understanding predictive information criteria for Bayesian models. Available from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.295.3501&rep=rep1&type=pdf
Best Answer
If you have read Burnham & Anderson's monograph, you know just why they discourage AIC(c)-based model selection: because they subscribe to the theory of tapering effect sizes. In a nutshell, they posit that everything has an effect - it's just that most effects are pretty small (sort of a "long tail"). Thus, an AIC(c)-selected model may be more parsimonious, but it will be systematically too small (the bias-variance trade-off). Therefore they recommend averaging models.
This is also the reason why statistical significance and p values are not en vogue in the Burnham & Anderson worldview. Tapering effect sizes are another way of saying that the true coefficients are almost always nonzero, just perhaps very small. Thus, the null hypothesis is already false a priori. P values pose a question that we already know the answer to.
Thus, if you follow B&A's philosophy far enough that you do AICc-based model averaging, it seems a bit incongruous to also discuss p values and/or "marginal significance".
Now, one possibility would be to simply discuss "averaged coefficients" and their CIs, without even discussing whether CIs contain zero. Conversely, if you are in a field that deifies p values (like psychology), it may make more sense to disregard these implications of B&A in the interest of talking in a way your readers will understand, rather than follow strict AICc purity.
(Anyway, my impression is that AICc and B&A have more of a following among non-statisticians, especially ecologists. So the nuances we are discussing here may already be far away from your readership's main interests.)