One practical approach (in case of supervised learning at least) is to include all possibly relevant features and use a (generalized) linear model (logistic regression, linear svm etc.) with regularization (L1 and/or L2). There are open source tools (e.g. Vowpal Wabbit) that can deal with trillions of example/feature combinations for these types of models so scalability is not an issue (besides, one can always use sub-sampling). The regularization helps to deal with feature selection.
The basic idea when using PCA as a tool for feature selection is to select variables according to the magnitude (from largest to smallest in absolute values) of their coefficients (loadings). You may recall that PCA seeks to replace $p$ (more or less correlated) variables by $k<p$ uncorrelated linear combinations (projections) of the original variables. Let us ignore how to choose an optimal $k$ for the problem at hand. Those $k$ principal components are ranked by importance through their explained variance, and each variable contributes with varying degree to each component. Using the largest variance criteria would be akin to feature extraction, where principal component are used as new features, instead of the original variables. However, we can decide to keep only the first component and select the $j<p$ variables that have the highest absolute coefficient; the number $j$ might be based on the proportion of the number of variables (e.g., keep only the top 10% of the $p$ variables), or a fixed cutoff (e.g., considering a threshold on the normalized coefficients). This approach bears some resemblance with the Lasso operator in penalized regression (or PLS regression). Neither the value of $j$, nor the number of components to retain are obvious choices, though.
The problem with using PCA is that (1) measurements from all of the original variables are used in the projection to the lower dimensional space, (2) only linear relationships are considered, and (3) PCA or SVD-based methods, as well as univariate screening methods (t-test, correlation, etc.), do not take into account the potential multivariate nature of the data structure (e.g., higher order interaction between variables).
About point 1, some more elaborate screening methods have been proposed, for example principal feature analysis or stepwise method, like the one used for 'gene shaving' in gene expression studies. Also, sparse PCA might be used to perform dimension reduction and variable selection based on the resulting variable loadings. About point 2, it is possible to use kernel PCA (using the kernel trick) if one needs to embed nonlinear relationships into a lower dimensional space. Decision trees, or better the random forest algorithm, are probably better able to solve Point 3. The latter allows to derive Gini- or permutation-based measures of variable importance.
A last point: If you intend to perform feature selection before applying a classification or regression model, be sure to cross-validate the whole process (see ยง7.10.2 of the Elements of Statistical Learning, or Ambroise and McLachlan, 2002).
As you seem to be interested in R solution, I would recommend taking a look at the caret package which includes a lot of handy functions for data preprocessing and variable selection in a classification or regression context.
Best Answer
The first cutoff, the principal components that explain 50% of the total variance, is indeed suggested based on the authors' experiments on the KDD CUP 99 dataset. Underneath Table 2 they explain that they tested cutoffs between 30% to 70%, and that 50% achieved the highest detection rate at the lowest false alarm rate.
As far as I can tell, they have not mentioned any reasoning behind choosing eigenvalues less than 0.2 but I suspect that they used a similar method. That is, testing various cutoffs between some range, and choosing the cutoff which gives the best results on this dataset.
Slightly unrelated, but very important if you are doing research in Intrusion Detection: be very careful with the DARPA 1998 and KDD CUP 99 datasets. It has been known for a very long time now that these datasets are inherently flawed, and that techniques cannot be accurately evaluated using them [1][2]. The NSL-KDD dataset [2] may be a more reliable evaluation but is still not ideal. Furthermore, there is some interesting debate on the overwhelming use of machine learning and other anomaly detection techniques in intrusion detection research [3]. You might want to read the papers in the reference list for more details.
References: