Reference: Machine Learning Foundations lecture by Yishay Mansour. I am struggling to understand the explanation when they arrive at $X_S(e_j)$
Could someone provide some details? Thanks.
Best Answer
The VC dimension is the largest value of $N$ for which there exists a data set of $N$ points that can be shattered by the concept class. Shattering a data set means that every possible labeling of the data set can be realized by picking the right function from the concept class.
In this case, the feature space is the set of all possible length-$n$ bit vectors. If we want to say that the VC dimension is at least $N$, we need to exhibit a data set of size $N$ that can be shattered by $C_6 = \{\chi_S: S \subseteq \{1, \dots, n\} \}$. The example above says to pick the data set to be the $n$ unit vector columns of the $n \times n$ identity matrix. Now suppose we have some arbitrary labeling of these vectors $b_1, \dots, b_n$. The claim is that we can pick an element of $C_6$ that achieves this labeling.
Indeed, choose $\chi_S \in C_6$ with $S = \{i: b_i = 1 \}$. Then for any $e_j$ with label $b_j$, our chosen $\chi_S$ will classify this vector as 1 if and only if $j \in S$. But by construction, $j \in S$ if and only if $b_j = 1$. So the classifier recovers the true labels for all $n$ points in the data set. We have exhibited a size $n$ data set that can be shattered by the concept class, so we know the VC dimension is at least $n$.
As a concrete example, let $n = 3$, so the data set is $e_1 = 100, e_2 = 010, e_3 = 001$. Now dream up any labeling you want of these three vectors, say $(b_1, b_2, b_3) = (0, 1, 1)$. Then the set $S$ is $\{2, 3\}$. Now we check that the classifier $\chi_S$ recovers the labels $b_i$ for each $i = 1,2,3$. Indeed,
$$
\chi_S(e_1) = 0 \\
\chi_S(e_2) = 1 \\
\chi_S(e_3) = 1
$$
which you can check using both definitions of $\chi_S$ (as a general bitwise sum or the special case for elementary unit vectors).
Distortion of statistical properties can occur when you "fit to the data", so I think of this more in terms of specifying the number of parameters that I can afford to estimate and that I want to devote to the portion of the model that pertains to that one predictor. I use regression splines, place knots where $X$ is dense, and specify the number of knots (or the number of parameters and back calculate the number of knots) by asking (1) what does the sample size and distribution of $Y$ support and (2) what is the signal:noise ratio in this dataset. When $n \uparrow$ or signal:noise ratio $\uparrow$ I can use more knots. There is no set formula for the number of parameters that should be fitted, although in a minority of situations you can use cross-validation or AIC to determine this. As you mentioned, shrinkage is a great alternative, because you can start out with many parameters then shrink the coefficients down to what cross-validation or effective AIC dictate.
The book by Prince, recommended by @seanv507 is indeed an excellent book on the topic (+1). And while it is not really compact, it has very logical structure and even a generous refresher chapter on probability as well as great focus on machine learning within computer vision context.
However, I'd like to recommend another excellent book on the topic (also freely downloadable), which, while having more focus on computer vision per se, IMHO contains enough machine learning material to qualify for an answer. The book that I'm talking about is "Computer Vision: Algorithms and Applications" by Richard Szeliski (Microsoft Research). One of the advantages of this book versus the one by Price is... narrower margins, which allow for larger font size and, thus, better readability. Also, the book by Szeliski is very practical. Since both books share significant content, but have somewhat different focus, in my opinion, they very well complement each other. All this, among other advantages, makes it very easy for me to highly recommend Szeliski's book.
Best Answer
The VC dimension is the largest value of $N$ for which there exists a data set of $N$ points that can be shattered by the concept class. Shattering a data set means that every possible labeling of the data set can be realized by picking the right function from the concept class.
In this case, the feature space is the set of all possible length-$n$ bit vectors. If we want to say that the VC dimension is at least $N$, we need to exhibit a data set of size $N$ that can be shattered by $C_6 = \{\chi_S: S \subseteq \{1, \dots, n\} \}$. The example above says to pick the data set to be the $n$ unit vector columns of the $n \times n$ identity matrix. Now suppose we have some arbitrary labeling of these vectors $b_1, \dots, b_n$. The claim is that we can pick an element of $C_6$ that achieves this labeling.
Indeed, choose $\chi_S \in C_6$ with $S = \{i: b_i = 1 \}$. Then for any $e_j$ with label $b_j$, our chosen $\chi_S$ will classify this vector as 1 if and only if $j \in S$. But by construction, $j \in S$ if and only if $b_j = 1$. So the classifier recovers the true labels for all $n$ points in the data set. We have exhibited a size $n$ data set that can be shattered by the concept class, so we know the VC dimension is at least $n$.
As a concrete example, let $n = 3$, so the data set is $e_1 = 100, e_2 = 010, e_3 = 001$. Now dream up any labeling you want of these three vectors, say $(b_1, b_2, b_3) = (0, 1, 1)$. Then the set $S$ is $\{2, 3\}$. Now we check that the classifier $\chi_S$ recovers the labels $b_i$ for each $i = 1,2,3$. Indeed, $$ \chi_S(e_1) = 0 \\ \chi_S(e_2) = 1 \\ \chi_S(e_3) = 1 $$ which you can check using both definitions of $\chi_S$ (as a general bitwise sum or the special case for elementary unit vectors).