You may want to have a look at this Wikipedia article. It has a nice overview of the most popular algorithms, although some more recent developments, like tSNE, are missing.
In feature extraction the fundamental idea is looking for an alternative representation where the underlying structure of the data is more apparent. This is done by minimizing some error or energy functional which yields that mapping.
Some approaches like PCA, CCA, local linear embeddings (LLE), local linear projections (LLP), and some others are attractive because you end up solving a linear problem, for which there are efficient numerical methods. The nice thing about many of them (like LLE) is that they are able to consider non-linear mappings, but still you solve a linear system.
The idea is that you introduce a matrix whose elements encode relationships between samples (some distance). You then find the projections of the original data according to that matrix, into a lower dimensional space in such a way that the distortion is minimal. In this lower dimensional (usually two, so you can visualize it in your screen) the different patterns in your data are more evident. Usually, to describe those relationships between samples, only the k-closests samples to each data point are considered (which is a non-linear relationship).
Still, non-linear cases like tSNE and others, can also be solved efficiently by means of some gradient based optimization algorithm.
Which method might be better, depends on your data, and your amount of data (computational costs). I am not aware of any objective criteria which might let you decide beforehand which one to use, but just try them out. (Unless you know your data follows some trivial lineal pattern). For tSNE, LLE, and other methods are a numnber of implementations, often in Matlab, but also in other languages and packages.
An alternative to dimensionality reduction is to use the hashing trick to train a classifier on the entire feature set without reduction beforehand.* The Vowpal Wabbit pwoject--er, project--is an implementation of various learning algorithms using the hashing trick to speed up computation:
VW is the essence of speed in machine learning, able to learn from terafeature datasets with ease. Via parallel learning, it can exceed the throughput of any single machine network interface when doing linear learning, a first amongst learning algorithms.
I don't know if VW will end up being right for you (if you have billions of features, a lot of your choices may end up being dictated by software engineering considerations), but hopefully it's a pointer in the right direction!
* Well, the hashing trick is technically a kind of dimensionality reduction, but only in a very silly sense.
Best Answer
Feature selection and feature reduction are two very different strategies.
Generally speaking, non-parametric tests is probably your best option, I'd go towards a Kruskal-Wallis rank sum test to get overall differential features or a Mann-Whitney rank sum test for each class label.
For feature reduction, I believe zero-inflated factorial analysis (ZIFA) is a good solution (Pierson, Emma, and Christopher Yau. "ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis." Genome biology 16.1 (2015): 241.). However, more classical factor analysis, NMF or t-SNE may work.
Now the data you describe really looks a lot like a single-cell RNAseq dataset. If this is the case, I encourage you to take a look at the following resource: https://hemberg-lab.github.io/scRNA.seq.course/biological-analysis.html#de-in-a-real-dataset