Solved – Text analysis : What after term-document matrix

machine learningnatural languagescikit learnsvmtext mining

I am trying to build predictive models from text data. I built document-term matrix from the text data (unigram and bigram) and built different types of models on that (like svm, random forest, nearest neighbor etc). All the techniques gave decent results, but I want to improve the results. I tried tuning the models by changing parameters, but that doesn't seem to improve the performance much. What are the possible next steps for me?

Best Answer

Natural language data is usually "noisy" because of the problems like synonymy (different words have the same meaning) and polysemy (the same word has different meaning). You can try to "de-noise" this data by applying dimensionality reduction techniques.

One possibility would be to apply SVD to decompose your document-term matrix as $D = U \Sigma V^T$. If you keep only $k$ largest singular values and approximate $D$ as $D \approx U_k \Sigma_k V_k^T$, what you will get is called "Latent Semantic Analysis": it discovers "latent" concepts in the data set. So you can apply this to your problem and see if it gives a better solution or not.

In scikit learn it would be something like this (code from here):

hasher = HashingVectorizer(n_features=n_features,
                           stop_words='english', non_negative=True,
                           norm=None, binary=False)
vectorizer = make_pipeline(hasher, TfidfTransformer())
X = vectorizer.fit_transform(dataset.data)

svd = TruncatedSVD(k)
lsa = make_pipeline(svd, Normalizer(copy=False))

X = lsa.fit_transform(X)

clf = MultinomialNB().fit(X, labels)

Alternatively, you can apply a different decomposition technique called "Non-Negative Matrix Factorization", which also gives an approximate solution $D \approx U V^T$, but all elements of $U$ and $V$ are non-negative.

In scikit learn (code from here):

tfidf = vectorizer.fit_transform(dataset.data)
nmf = NMF(n_components=k, random_state=1)
X = nmf.fit_transform(tfidf)

# do something with X