Solved – Does it make sense using Machine Learning techniques on a sparse features matrix

machine learningpredictive-modelssparsetopic-models

I am trying to predict the sentiment (neg/neutral/pos) of a given text.
To do so I use a LDA model (Latent Dirichlet Allocation) that is a topic discovery model.

The LDA model works as follows: given a number N (the model will have to define N topics) and a training corpus, it returns N weighted combination of words that are the respective topics.

Once trained, and given a document, the LDA model will then return a vector of the topic loadings for that document. I apply that to my testing set which gives me my matrix of features X.

I then try to use a machine learning tool to train and predict the sentiment (this is my label vector Y).

Problem with LDA is that depending on your parameters, each document can have a very concentrated distribution across the topics (for instance a document could be fully explained by 2 or 3 topics). This then gives me a very sparse matrix.
To give you an example, I usually set N to 200 and each document is fully explained by 2 to 5 topics, so each column of X will have about 195 to 198 zeros.

It seems to me counter intuitive to use ML techniques on such a matrix. The first example that comes randomForest that will be bagging my observations and randomly picking a few features, leading to a lot of zeros in the observations.

It appears that in my case, logistic regression is doing better than any of the more complex classification methods I tried (random forest, adaptative and gradient boosting, KNN and SVM).

Is it a result that makes sense or am I missing something that could make ML more useful in this case?
Many thanks for your advice!

Best Answer

Does it make sense using Machine Learning techniques on a sparse features matrix?

Yes. But you need to use the right methods. Actually there is a whole subdiscipline that covers that - there is even a textbook called Statistical Learning with Sparsity.

Logistic regression is a good example of a method that can incorporate sparsity. If you use $l_1$ penalty (adding $\lambda \|w\|_1 = \sum_{i}|w_i|$ to the error term) you can enforce $w$ to be sparse.

The usefulness of sparsity is even reflected in implementations: in scikit-learn for example linear methods can be used with both dense and sparse matrices, and they leverage the fact that the input is sparse (there exist solvers specifically for sparse matrices).

If you want to read more about the theory, I encourage you to read the textbook I mentioned, especially the Lasso part (for linear models the theory is easier to understand, than for logistic regression, and there even exist proofs that if there exists a sparse solution of OLS problem, then Lasso/some other method will retrieve them) .

If you want to dive into applications, then you can check SLwS authors: they have two other textbooks, one of which contains many R examples. Or you can check scikit-learn's documentation for examples, if you're a Python guy.

Related Solutions

Text Mining – Topic Prediction Using Latent Dirichlet Allocation

I'd try 'folding in'. This refers to taking one new document, adding it to the corpus, and then running Gibbs sampling just on the words in that new document, keeping the topic assignments of the old documents the same. This usually converges fast (maybe 5-10-20 iterations), and you don't need to sample your old corpus, so it also runs fast. At the end you will have the topic assignment for every word in the new document. This will give you the distribution of topics in that document.

In your Gibbs sampler, you probably have something similar to the following code:

// This will initialize the matrices of counts, N_tw (topic-word matrix) and N_dt (document-topic matrix)
for doc = 1 to N_Documents
    for token = 1 to N_Tokens_In_Document
       Assign current token to a random topic, updating the count matrices
    end
end

// This will do the Gibbs sampling
for doc = 1 to N_Documents
    for token = 1 to N_Tokens_In_Document
       Compute probability of current token being assigned to each topic
       Sample a topic from this distribution
       Assign the token to the new topic, updating the count matrices
    end
end

Folding-in is the same, except you start with the existing matrices, add the new document's tokens to them, and do the sampling for only the new tokens. I.e.:

Start with the N_tw and N_dt matrices from the previous step

// This will update the count matrices for folding-in
for token = 1 to N_Tokens_In_New_Document
   Assign current token to a random topic, updating the count matrices
end

// This will do the folding-in by Gibbs sampling
for token = 1 to N_Tokens_In_New_Document
   Compute probability of current token being assigned to each topic
   Sample a topic from this distribution
   Assign the token to the new topic, updating the count matrices
end

If you do standard LDA, it is unlikely that an entire document was generated by one topic. So I don't know how useful it is to compute the probability of the document under one topic. But if you still wanted to do it, it's easy. From the two matrices you get you can compute $p^i_w$, the probability of word $w$ in topic $i$. Take your new document; suppose the $j$'th word is $w_j$. The words are independent given the topic, so the probability is just $$\prod_j p^i_{w_j}$$ (note that you will probably need to compute it in log space).

Solved – Can a labeled LDA (Latent Dirichlet Allocation) dataset have just one label per document

There's nothing stopping you, but this essentially reduces to learning a bag of words model for each label, albeit with a shared prior in the form of $\eta$. The new model would look like this:

To see why these are equivalent, see this snippet from the labelled LDA paper:

The traditional LDA model then draws a multinomial mixture distribution $\theta^{(d)}$ over all $K$ topics, for each document $d$, from a Dirichlet prior $\alpha$. However, we would like to restrict $\theta^{(d)}$ to be defined only over the topics that correspond to its labels $\Lambda(d)$. Since the word-topic assignments $z_i$ (see step 9 in Table 1) are drawn from this distribution, this restriction ensures that all the topic assignments are limited to the document’s labels.

If the document has only a single—and importantly, observed—label, its topic assignment is limited to the corresponding topic, and all its words are generated from the same multinomial distribution. (This is because $\Lambda(d)$ will ensure that only one value of $\theta^{(d)}$ is nonzero.)

It bears superficial resemblance to a mixture of unigrams, where each document is produced by a single topic. But in that model the topic is a latent variable, and in your case it's observed. Cf. the mixture of unigrams model as described in the original LDA paper:

Under this mixture model, each document is generated by first choosing a topic $z$ and then generating $N$ words independently from the conditional multinomial $p(w|z)$. [...] When estimated from a corpus, the word distributions can be viewed as representations of topics under the assumption that each document exhibits exactly one topic.

There's nothing wrong with bag of words, but it's worth noting that the paper introducing LDA demonstrated better performance, in terms of perplexity, in two experiments. (Figure 9.)

To the question about classification, sure: Each bag of words model will give you a likelihood for the document, and you can use this along with a prior on topics to find $p(z|\textbf{w})$ using Bayes rule. (If your prior on topics is uniform, this is equivalent to maximum likelihood.)

You've only asked whether this is possible, and not about likely performance. But for what it's worth, my intuition is that you'll get better predictive performance with regular LDA and a subsequent classifier. When in doubt, cross validate.

Best Answer

Related Solutions

Text Mining – Topic Prediction Using Latent Dirichlet Allocation

Solved – Can a labeled LDA (Latent Dirichlet Allocation) dataset have just one label per document

Related Question