Solved – Why is it a big problem to have sparsity issues in Natural Language processing (NLP)

deep learningmachine learningnatural languageneural networks

I was watching a stanford lecture on deep learning with NLP and he pointed out a bunch of issues with co-ocurrance matrices (which I assume apply to n-gram models and any other count method like that). The slide that pointed out the issues were:

Problems with simple co-ocurrence vectors

  • Increase in size with vocabulary
  • Very high dimensional: require a lot of storage
  • Subsequent classification models have sparsity issues

$\implies$ Models are less robust

(Richard Socher, 4/1/15)

I think the first two issues sort of make sense, it seems that computing the a co-occurance matrix can be as big as $$ C^V_k$$ where V is the vocabulary and k is the window. That expression is usually really big I believe (plug in stirlings or something, its exponential I believe unless k is really small). If its really big its hard to store, updated, compute predictions etc.

The issue I don't understand is the Sparsity. Why is it a problem if its sparse, I thought sparsity was usually a nice feature in ML (like l1 norm etc). I also don't understand what robustness has to do with sparsity. I usually think of robustness sort of like having low variance (not changing to much if the data changes). How is sparsity affecting this overfitting/underfitting issue? I don't see the relation.

The only issue I could possibly see is if the model for predicting somehow uses fractions (i.e. a PGM) and the the denominator is zero then numerical errors happen (like with n-grams), or if the numerator is zero, then saying something is impossible is usually a bit of a stretch…is that what the issue is?

His use of the word classifier made me suspect that these models might use logistic regression or something…then he goes on to talk about SVD. How is SVD helping wrt to robustness? (its clear it makes things smaller).

In short, what is the issue with sparsity and why does it affect the robustness and generalization so much?

Best Answer

There are two kinds of sparsity: data sparsity and model sparsity. Model sparsity can be good because it means that there is a concise explanation for the effect that we are modeling. Data sparsity is usually bad because it means that we are missing information that might be important. That slide is talking about data sparsity.

Using the SVD to compress the matrix gives us dense low-dimensional vectors for each word. This is a way of sharing information between similar words to help deal with the data sparsity. Another thing that people sometimes do to deal with sparsity is to use sub-word units instead of words or to use stemming or lexicalization to reduce the vocabulary size.

Data sparsity is more of an issue in NLP than in other machine learning fields because we typically deal with large vocabularies where it is impossible to have enough data to actually observe examples of all the things that people can say. There will be many real phrases that we will just never see in the training data.

Related Question