Mathematical Statement of Theorem vs. Visual Statement of Theorem

combinatoricsgeometrymachine learningprobability

I was reading about "Cover's Theorem" over here : https://web.mit.edu/course/other/i2course/www/vision_and_learning/perceptron_notes.pdf (page 3).

1) Visual Statement : https://en.wikipedia.org/wiki/Cover%27s_theorem

Cover's Theorem is a famous theorem used in Computer Science – this theorem was said to be the basis of Kernel Based Algorithms in Machine Learning. This theorem states that a dataset that is "not linearly separable" in a lower dimension – if projected into a higher dimension using the right kernel, might become "linearly separable". This is a desirable outcome, seeing that in Machine Learning, it is generally easier to build better performing models on Linearly Separable Data compared Non-Linearly Separable Data.

As such, the visual statement of this theorem is easy to understand :

enter image description here

2) Mathematical Statement:

From the first reference (page 3), here is the mathematical statement of Cover's Theorem:

enter image description here

From a Machine Learning perspective, the theorem starts by assuming, you have a dataset with with "p" rows and "n" columns. A graph from the reference shows that as the number of points increases for some fixed value of number of dimensions, the fraction of "linearly separable dichotomies" reduces:

enter image description here

Can someone please help me understand:

  • What is meant by "distinct dichotomies"? (I think "distinct dichotomies" refers to the number of ways a hyperplane that can be placed through these points)

  • How is Cover's Theorem showing that data points in a higher dimension are more likely to
    be linearly separable compared to in a lower dimension? How do we understand if in a given dimension, a "dichotomy" can be considered as "linearly separable"? (i.e. how did they calculate the y-axis of the graph)?

Thanks!

Best Answer

A dichotomy, in this context, is a way to label the all of points with two labels, say red and blue. Two dichotomies are distinct if at least one point is labeled differently. There are $2^P$ dichotomies for $P$ points. However, we say a dichotomy is realized by a hyperplane if there is a hyperplane so the red points are on one side and the blue points are on the other. Cover's counting theorem tells you the number of these dichotomies which are realized by hyperplanes.

Once you have proved the formula $C(P,N)=2\sum_{k=0}^{N-1}\binom{P-1}{k}$, it easily follows* from properties of the binomial coefficients that...

  • $C(P,N)=2^P$ whenever $P\le N$. This means that every dichotomy can be realized by a hyperplane when $P\le N$, i.e. when there aren't too many data points.

  • $C(P,N)<2^P$ whenever $P>N$. This means that when there are too many data points, then there will exist some dichotomies that are not realized by hyperplanes. Therefore, if $P>N$, and you are trying to solve a classification problem using a hyperplane, there is a possibility you will not be able to succeed.

Finally, consider the ratio $$ \frac{C(P,N)}{2^P}=\frac{\text{# dichotomies realized by hyperplanes}}{\text{total # of dichotomies}} $$ This is always between $0$ and $1$. The closer to $1$ it is, the more likely a random dichotomy is to be one of the ones realizable by a hyperplane. This is the $y$-axis of the graph. Cover's theorem implies that as $P$ remains fixed, and $N$ increases, then the above ratio increases or stays the same. This is what is meant by the statement "In higher dimensions, a dichotomy is more likely to be realized by a hyperplane."


* Let me prove these two facts. First, suppose $N=P$. In this case, $\sum_{k=0}^{N-1}\binom{P-1}{k}$ is simply the sum of all binomial coefficients in row number $P-1$ of Pascal's triangle, which is well known to be $2^{P-1}$. This can be proven with the binomial theorem: $$ 2^{P-1}=(1+1)^{P-1}=\sum_{k=0}^{P-1}\binom{P-1}k 1^{k}1^{n-k}=\sum_{k=0}^{P-1}\binom{P-1}k $$ Multiplying by $2$, you get $C(P,N)=2\cdot 2^{P-1}=2^P$.

If instead $N<P$, then the sum $\sum_{k=0}^{N-1}\binom{P-1}{k}$ only extends up to $k=N$, so it misses some of the binomial coefficients in row $P-1$. Therefore, the sum must be less than $P$. Finally, if $P>N$, then the sum extends past $P-1$, containing coefficients like $\binom{P-1}{P},\binom{P-1}{P+1}$, etc. These are all zero, by definition, so the sum is still $2^{P-1}$ in this case.

When $N=5,P=30$, $$ \frac{C(P,N)}{2^P}=\frac{2\sum_{k=0}^{5-1}\binom{30-1}{k}}{2^{30}}=\frac{\binom{29}0+\binom{29}1+\binom{29}2+\binom{29}3+\binom{29}4}{2^{30}}\approx 0.5\times 10^{-5} $$