Setting NaNs to the mean of the training or test set is ok?
You need to define a procedure that you always follow. I see two valid options here:
- either use some value (e.g. mean) calculated for that subject (see also below)
- or some value calculated for the training set, basically a hyperparameter "value to be used for replacing NAs". This should not be calculated from the whole test set (independent testing also means that no parameters calculated from other test subjects should be used: the processing should not depend on the composition of the test set).
Edit:
Which method to follow (replace by value computed within subject or by value computed within training set) should IMHO be decided from the knowledge about the application and the data, we cannot tell you more than very general guidelines here.
Why not by value computed within test set?: That would mean that the value used to replace NA
s in test subject A depends on whether subject A is tested together with subject B or subject C – which doesn't seem to be a desirable or sensible behaviour to me.
You may also want to look up "Imputation" which is the general term for techniques that fill in missing values.
Centering and scaling (standardization): if you have "external" (scientific) knowledge that suggests that a standardization within the subjects should take place, then go ahead with that. Whether this is sensible depends on your application and data, so we cannot answer this question.
For a more general discussion of centering and standardization, see e.g. Variables are often adjusted (e.g. standardised) before making a model - when is this a good idea, and when is it a bad one? and When conducting multiple regression, when should you center your predictor variables & when should you standardize them?
Now within each outer fold I plan to tune a classifier's parameter with help of another cross-validation.
With 4 subjects you probably won't be able to compare classifiers anyways: the uncertainty due to having only 4 test cases is far too high. Can't you fix this parameter by experience with similar data?
To illustrate the problem: assume you observe 4 correct classifications out of 4 test subjects. That gives you a point estimate of 100% correct predictions. If you look at confidence interval calculations for this, you get e.g. for the Agresti-Coull method a 95% ci of 45 - 105% (obviously not very precise with the small sample size), Bayes method with uniform prior makes it 55 - 100%. In any case it means that even if you observe perfect test results, it is not quite clear whether you can claim that the model is actually better than guessing. As long as you do not need to fear that fixing the parameter beforehand will produce a model that is clearly worse than guessing, you anyways cannot measure improvements in the practically important range.
The situation may be less drastic if you optimize e.g. Brier score but with 4 subjects I'd suspect that you still do not reach the precision you need for the expected improvement during the optimization.
Edit: Unfortunately, while 20 subjects are far more than 4, from a classifier validation statistics point of view, 20 is still very few.
We recently concluded that if you need to stick with the frequently used proportions for characterizing your classifier, at least in our field you cannot expect to have a useful precision in the test results with less than 75 - 100 test subjects (in the denominator of the proportion!). Again, you may be better off if you can switch to e.g. Brier's score, and with a paired design
for classifier comparison, but I'd call it lucky if that gains you a factor of 5 in sample size.
You can find our thoughts here: Beleites, C. and Neugebauer, U. and Bocklitz, T. and Krafft, C. and Popp, J.: Sample size planning for classification models. Anal Chim Acta, 2013, 760, 25-33.
DOI: 10.1016/j.aca.2012.11.007
accepted manuscript on arXiv: 1211.1323
AFAIK, dealing with the random uncertainty on test results during classifier optimization is an unsolved problem. (If not, I'd be extremely interested to papers about the solution!)
So my recommendation would be to do a preliminary experiment/analysis at the end of which you try to estimate the random uncertainty on the comparison results. If these do not allow to optimize (which I'd unfortunately expect to be the outcome), report this result and argue that in consequence you do not have any choice at the moment but fixing the hyper-parameters to some sensible (though not optimized) value.
Does the inner cross-validation necessarily need to be leave-one-subject-out as well?
If you do inner cross validation it would be better to do it subject-wise as well: without this, you'll get overly optimistic inner results. Which would not a problem iff the bias were constant. However, it usually isn't and you have the additional problem that due to the random uncertainty together with the optimistic bias you may observe many models that seem to be perfect. Among these you cannot distinguish (after all, they all seem to be perfect) which can completely mess up the optimization.
Again, with so few subjects I'd avoid this inner optimization and fix the parameter.
After thinking a bit about your problem it is essentially coreset selection, i.e., finding a small subset (the coreset) of the training data such that the model trained on the subset is as close as possible to the model trained on the full dataset.
I'm not familiar with the area and it isn't a very easy term to google, but the paper Near-optimal Coresets For Least-Squares Regression (and the papers it references) is probably very relevant. It doesn't look like they directly solve the problem of finding the best $k$ points though. Instead their algorithm produces a coreset of size polynomially bounded in terms of a desired accuracy parameter and the rank of the data matrix. You might be able to massage their technique into want you want though by playing around with the bounds.
One thing I do want to mention is the idea of using some sort of CV procedure, like fitting a model for each $\binom{20}{4}$ split of your data and selecting the 4 points that results in a model with the minimum error on the remaining 16 points, as your search technique seems flawed. Notice that there is a sort of symmetry, as you could also view this procedure as picking the 16 points that are most easily predicted from 4, which almost certainly isn't what you really want. Essentially all such a procedure would be doing is searching for the split of your data that makes the problem easiest.
Best Answer
If there is one subject per row, then
method = LOOCV
would do it. You will have to setup your own resampling indicators and supply them viaindex
. At that point, the value ofmethod
does't matter.You could do something like:
(note that your test data set converts
var
-var4
to character, so I didn't test this.)Max