[Math] Connection between Vapnik-Chervonenkis dimension and regularization

I think this is my first research-level question, so I'm going to ask it here first before going to Math Overflow.

In most tutorial papers like Burges, the Vapnik-Chervonenkis dimension is introduced as a way of bounding the expected error of a learning machine by the sum of its training error and a slightly more complicated expression which is $O(\sqrt{log(n)})$:

$Error \leq \text{training error} + \sqrt{h(\log(2N/h)+1)-\log(\eta/4)\over N}$

The VC dimension $h$ is the number of points in two classes in some feature space that may be properly divided by a classifier operating on that space, where 'properly' here means with zero error regardless of the combination of classes assigned to the point. An example is the perceptron, which can learn any binary function except XOR, which means that its VC dimension is 3. This gives a highly conservative upper bound which enables one to select a learning machine with the smallest VC dimension $h$ that gives low training error.

Regularization, on the other hand, is a method of penalizing some feature of the solution generated by the learning machine. An example is $L_2$ regularization (Tikhonov):

$Cost(\mathbf{w}) = ||\mathbf{y} – \mathbf{w}^H\mathbf{X}||^2 + \lambda||\mathbf{w}||^2$

The first term is the mean-squared error, where $\mathbf{y}$ is a vector of desired outputs, and $\mathbf{X}$ is a matrix of corresponding observed features of the data for each $y$. The weight vector $\mathbf{w}$ here represents a linear function mapping the features to an estimate of the desired output (normally the nonlinear case is handled by kernel methods). The important thing here is the penalty $\lambda$, which indicates how much you pay for a 'big' weight vector (in the $L_2$ sense).

I have done a couple of searches and not seen anything characterizing the connection between these two methods, other than one informal ranking by conservativeness.

Does anyone know what regularization does to your VC dimension? Intuitively I feel that something like $L_1$ regularization would immediately start cutting it down, but I can't visualize what would happen. If there are papers I wouldn't mind referring to those.

Best Answer

The book Learning from Data by Yaser S. Abu-Mostafa et all gives a nice introductory path for the VC dimension and then in chapter 4 tackles regularization. There is one section on page 137 which talks about the connection between the two concepts. Basically it says that with regularization (augmented error) the VC dimension does not change, so it proposes to use of the effective number of parameters as a good surrogate for the VC dimension.

Best Answer

Related Solutions

[Math] Batch vs incremental gradient descent

[Math] Machine learning : Perceptron, purpose of bias and threshold

Related Question