Solved – Feature Reduction for Text Classification

boostingfeature selectionmachine learningtext mining

I have a dataset of ~43000 8k-filings (ad-hoc announcements from companies listed on the stock exchange) and i am trying to use decision trees with gradient boosting to develop a text classification algorithm (3 classes: positive, negative and neutral abnormal returns).

So far i removed stopwords from my documents and stemmed the remaining words. Still, if I count unique 1grams and 2grams of all documents I get about 130000 total unique 1grams and 1450871 unique bigrams total. I fear that without further feature reduction I will end up with curse of dimensionality.

Now I looked at the distribution of number of words per document and I find that only 356 documents have more than 2000 words and at the same time without these 356 documents my feature set would be reduced to only 89000 features.

So the question that comes to my mind is the following: Would it be ok to handle these documents with more than 2000 words as outliers and exclude them from my sample?

My further proceeding would be to use Tf-Idf, LSI and probably remove features with very low variance, then feed this to my algorithm.

Best Answer

A good measure of how "important" a term or n-gram is is to compute its TF-IDF (or one of its derivatives). Another good explainer here: https://cran.r-project.org/web/packages/tidytext/vignettes/tf_idf.html

You can cull terms from your term-space by selecting some terms that you know are low-information (e.g., stop-words, or terms that appear with equal frequency in most documents) and selecting the "Screening TF-IDF" to be the maximum TF-IDF value amongst this set. Then, you remove all terms with TF-IDF below this cutoff.

Since text mining is generally an unsupervised process, you really don't have a way to "validate" your results, but you can review them and make corrections.

You can also seed it with documents that you feel are the same topic and see if it finds that clustering as well.

Related Question