Solved – Data sparsity becomes a problem

self-studysparse

I read this in a paper

Developing an approach toward virtual synthesis parameter screening introduces two primary computational challenges: data sparsity and data scarcity.

…Such canonical representations, however, are necessarily sparse as there are many more actions that one might perform during the synthesis of a material, compared to the number of actions actually used.

However, I learnt before that data sparsity is a desired property.

When the data points are sparsely distributed, there will be a clear pattern that a machine can learn.

My understanding of data sparsity becoming a challenge is that if our data matrix looks like this:

$$(x_1, x_2, \cdots, x_N) = \begin{pmatrix}
1 &&&\\
& 1 \\
&& \ddots\\
&&&1
\end{pmatrix}
$$
Clearly, there will be no pattern that a machine could learn because there is data scarcity in each dimension. So, I think the problem of data sparsity is still a problem of data scarcity.

Questions:

  1. Is my understanding correct?

  2. Is there a way for telling when data sparsity poses a challenge?

  3. When data sparsity becomes a problem will reasonable dimensionality reduction helps?

Best Answer

Data sparsity is mostly a computational problem. Think of a recommender system that recommends thousands of products to hundreds of thousands of users, if you stored the data about user-product interaction in a matrix, it would be a huge amount of data consisting of lots of zeros (most users are interested just in a selected subset of products). More wisely, you would store such data using a sparse matrix representation (that records only the non-zeros). This would help with a storage, but still you would need an algorithm that can interact with such sparsely represented data and in many cases this is not offered off-the-shelf.

Moreover, since most of the values are zeros, for most samples, they bring relatively little information and if you train your model on such data, then you end up with a huge number of parameters that are useless most of the time. Models with huge number of parameters are problematic on their own, because to estimate the parameters you need huge amounts of data and the optimization algorithms that work well in such settings. So in cases where sparse data is common, like language data where the distribution of words is very skewed, we usually use some kind of dimensionality reduction, like embeddings (this is what word2vec does).

Related Question