Solved – Applying SMOTE and PCA to high dimensional data giving low accuracy

k nearest neighbouroversamplingpcaresampling

I have a high dimensional datasets of around 2300+ columns. The dataset consist of two class labels of which one is extremely biased and occurs less than 10%. I looked at the various algorithms and found that for high dimensional biased data we can apply resampling algorithms such as SMOTE first and then apply PCA and then build a training model. At first I resample the entire training dataset and did cross validation and I achieved recall for the biased label closed to 84%. Then I found that we have to only resample the training dataset and not on the testing set. So I applied resampling algorithm on 90% of training set and then tested on 10% of testing set. In this case also, I applied PCA as a filtering mechanism using WEKA and then tested the model. But I see the percentage dropped significantly to 46% when I use kNN using WEKA. Other classifiers were giving very worse results. Can anyone tell me if I have followed proper approach ? If then, how can I improve the accuracy ?

Best Answer

you correctly adopted the train-and-test approach, instead of cross validation. In fact, you should test the model on non-resampled data, in order to maintain the same distribution as in the population and obtain realiable indices of performances. To do this, you need to apply a filtered classifier, then SMOTE and/or undersampling. Still, accuracy (correct guesses on total instances) is not a realiable index of performance in highly skewed datasets, and sensitivity alone does not express the trade off in terms of reduced specificity. You should therefore adopt AUC or Youden Index (J=TPR-FPR) to verify your results. Also, after applying PCA, you may try a decision tree or rule-based algorithm. Hope this helps.

Related Question