I don't have a full answer to this question, but I can give a partial answer on some of the analytical aspects. Warning: I've been working on other problems since the first paper below, so it's very likely there is other good stuff out there I'm not aware of.
First I think it's worth noting that despite the title of their paper "When is `nearest neighbor' meaningful", Beyer et al actually answered a different question, namely when is NN not meaningful. We proved the converse to their theorem, under some additional mild assumptions on the size of the sample, in When Is 'Nearest Neighbor' Meaningful: A Converse Theorem and Implications. Journal of Complexity, 25(4), August 2009, pp 385-397. and showed that there are situations when (in theory) the concentration of distances will not arise (we give examples, but in essence the number of non-noise features needs to grow with the dimensionality so of course they seldom arise in practice).
The references 1 and 7 cited in our paper give some examples of ways in which the distance concentration can be mitigated in practice.
A paper by my supervisor, Ata Kaban, looks at whether these distance concentration issues persist despite applying dimensionality reduction techniques in On the Distance Concentration Awareness of Certain Data Reduction Techniques. Pattern Recognition. Vol. 44, Issue 2, Feb 2011, pp.265-277.. There's some nice discussion in there too.
A recent paper by Radovanovic et al Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data. JMLR, 11(Sep), September 2010, pp:2487−2531. discusses the issue of "hubness", that is when a small subset of points belong to the $k$ nearest neighbours of many of the labelled observations. See also the first author's PhD thesis, which is on the web.
- Under which circumstances would each be preferable?
You should try each on your dataset to be sure.
- How do you approach setting the optimal radius or k value?
You optimize this value to give you the best results on the training set.
Then, you report the performance it gives you on the test set.
For example, train on a 90% random partition of your data.
Use the remaining 10% to test the value you have chosen.
You can also use 5 folds cross validation if you want a better
way to assess the performance of your classifier.
Best Answer
For 1NN outlier detection:
For each object:
Usually k=1 to k=10 will be enough. See for example:
They did an insane amount of experiments. But on most data sets, kNN with k=1 was one of the best methods of I recall correctly.