Solved – Does it make sense using Machine Learning techniques on a sparse features matrix

machine learningpredictive-modelssparsetopic-models

I am trying to predict the sentiment (neg/neutral/pos) of a given text.
To do so I use a LDA model (Latent Dirichlet Allocation) that is a topic discovery model.

The LDA model works as follows: given a number N (the model will have to define N topics) and a training corpus, it returns N weighted combination of words that are the respective topics.

Once trained, and given a document, the LDA model will then return a vector of the topic loadings for that document. I apply that to my testing set which gives me my matrix of features X.

I then try to use a machine learning tool to train and predict the sentiment (this is my label vector Y).

Problem with LDA is that depending on your parameters, each document can have a very concentrated distribution across the topics (for instance a document could be fully explained by 2 or 3 topics). This then gives me a very sparse matrix.
To give you an example, I usually set N to 200 and each document is fully explained by 2 to 5 topics, so each column of X will have about 195 to 198 zeros.

It seems to me counter intuitive to use ML techniques on such a matrix. The first example that comes randomForest that will be bagging my observations and randomly picking a few features, leading to a lot of zeros in the observations.

It appears that in my case, logistic regression is doing better than any of the more complex classification methods I tried (random forest, adaptative and gradient boosting, KNN and SVM).

Is it a result that makes sense or am I missing something that could make ML more useful in this case?
Many thanks for your advice!

Best Answer

Does it make sense using Machine Learning techniques on a sparse features matrix?

Yes. But you need to use the right methods. Actually there is a whole subdiscipline that covers that - there is even a textbook called Statistical Learning with Sparsity.

Logistic regression is a good example of a method that can incorporate sparsity. If you use $l_1$ penalty (adding $\lambda \|w\|_1 = \sum_{i}|w_i|$ to the error term) you can enforce $w$ to be sparse.

The usefulness of sparsity is even reflected in implementations: in scikit-learn for example linear methods can be used with both dense and sparse matrices, and they leverage the fact that the input is sparse (there exist solvers specifically for sparse matrices).

If you want to read more about the theory, I encourage you to read the textbook I mentioned, especially the Lasso part (for linear models the theory is easier to understand, than for logistic regression, and there even exist proofs that if there exists a sparse solution of OLS problem, then Lasso/some other method will retrieve them) .

If you want to dive into applications, then you can check SLwS authors: they have two other textbooks, one of which contains many R examples. Or you can check scikit-learn's documentation for examples, if you're a Python guy.