Solved – In natural language processing (NLP), how do you make an efficient dimension reduction

machine learningnatural languagesvmtext mining

In NLP, it's always the case that the dimension of the features are very huge. For example, for one project at hand, the dimension of features is almost 20 thousands (p = 20,000), and each feature is a 0-1 integer to show whether a specific word or bi-gram is presented in a paper (one paper is a data point $x \in R^{p}$).

I know the redundancy among the features is huge, so dimension reduction is necessary. I have three questions:

1) I have 10 thousands data points (n = 10,000), and each data points has 10 thousands features (p = 10,000). What is the effieient way to conduct dimension reduction? The matrix $X \in R^{n \times p}$ is so huge that both PCA (or SVD, truncated SVD is OK, but I don't think SVD is a good way to reduce dimention for binary features) and Bag of Words (or K-means) is hard be be directly conducted on $X$ (Sure, it is sparse). I don't have a server, I just use my PC:-(.

2) How to judge the similarity or distance among two data points? I think the Euclidean distance may not work well for binary features. How about L0 norm? What do you use?

3) If I want to use SVM machine (or other kernel methods) to conduct classification, which kernel should I use?

Many Thanks!

Best Answer

My suggestions to your questions:

1) If you have class labels, you can just filter out those features (words) that score low on Information Theory statistics (Information Gain, Chi Squared). Some papers demonstrate that you just keep top 1-10% Information Gain top scoring terms and even increase classification accuracy. If you do not have class labels, that is, in unsupervised classification tasks like text retrieval or clustering, the experience in Information Retrieval (Salton's Vector Space Model) is that those terms ocurring between 1% and 10% of the documents are the most discriminative ones, that is, the ones that help best to separate the space of documents.

2) The similarity between two documents is most often computed using the cosine similarity, which is the scalar product of both vectors over the product of their norms. This meassure scales to documents of any length (that is, it allows to compare reports and tweets) and it is in the [0,1] range - allowing to say e.g. these documents are 75% similar.

3) The experience in text classification is that over-linear kernels do not pay. Classes in text classification problems are most often nearly linearly separable. For instance, Mitchell talks about the surprising effectivenes of Naive Bayes for text classification problems, being this method linear (as Naive, independence assumed). My recommendation is to train a linear model if possible, and to try random forests and deep learning as well, as they are currently showing excellent results in NLP and in Machine Learning in general.

Related Question