Solved – Feature selection for very sparse data

classificationfeature selectionhigh-dimensionallarge datasparse

I have a dataset of dimension 3,000 x 24,000 (approximately) with 6 class label. But the data is very sparse. The number of non-zero values per sample ranges from 10-300 (approx) out of 24,000. The non-zero values in the dataset are real numbers. I need to perform feature selection/reduction before the classification. Which technique would be better for such dataset?

Best Answer

Feature selection and feature reduction are two very different strategies.

Generally speaking, non-parametric tests is probably your best option, I'd go towards a Kruskal-Wallis rank sum test to get overall differential features or a Mann-Whitney rank sum test for each class label.

For feature reduction, I believe zero-inflated factorial analysis (ZIFA) is a good solution (Pierson, Emma, and Christopher Yau. "ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis." Genome biology 16.1 (2015): 241.). However, more classical factor analysis, NMF or t-SNE may work.

Now the data you describe really looks a lot like a single-cell RNAseq dataset. If this is the case, I encourage you to take a look at the following resource: https://hemberg-lab.github.io/scRNA.seq.course/biological-analysis.html#de-in-a-real-dataset