Machine Learning – Understanding Input Space and Importance of Fraction of Input Space Covered by Training Set

feature selectionmachine learningoverfittingterminology

In this paper giving an overview of machine learning, the author writes:

Generalizing correctly becomes exponentially harder as the dimensionality (number of features) of the examples grows, because a fixed-size training set covers a dwindling fraction of the input space. Even with a moderate dimension of $100$ and a huge training set of a trillion examples, the latter covers only a fraction of about $10^{-18}$ of the input space. This is what makes machine learning both necessary and hard.

I also don't understand what he means by the input space in this context. I know he's referring to a vector space. I think he might be referring to a vector space which somehow represents all of the parameters. But I don't understand how this relates to the training set examples, and where he's getting the number $10^{18}$ from $100$ features and 1 trillion training examples – is there some way of calculating this, or is it some kind of estimate somehow?

I have the impression that the fraction of the input space covered by the training set relates to the idea that one cannot solve a system of equation with more columns than rows, and I can vaguely see how in machine learning there could generally be issues if one just didn't have many columns compared to rows. But I'm not sure, and I wish I understood why this mattered but I don't even know where to look to find this information.

References

Domingos, P. (2012). A few useful things to know about machine learning. Communications of the ACM, 55(10), 78-87.

Best Answer

The "input space" is just all the possible inputs.

In this example, he is assuming that each dimension is binary so that means there are $2^{100}$ possible inputs. A trillion examples would cover only $1/10^{18}$ (i.e. $10^{-18}$) of that input space.

As for why its important to cover a large part of the input space, the short answer is that you need to see the behavior of what you are trying to learn in enough of the input space to build a good learner.

As a toy example, if you have two points: $(0, 5)$ and $(10, 15)$ the best you can do to fit them is the line $y = x + 5$. But you can't be confident in that because you have no idea how the function acts between $x=0$ and $x=10$ (much less outside this space). This example might seem silly and obvious, but what is talking about is essentially a generalization of this. But the problem is even worse than that because the amount of data you need grows exponentially with the number of dimensions. This is the curse of dimensionality.

Related Question