Solved – Why do we use the Unregularized Cost to plot a Learning Curve

biasregularizationsupervised learningvariance

I'm taking Andrew Ng's Machine Learning Course.
In the section on determining the variance/bias of your model, he suggests the following.

For a given regularization parameter and set of features
Create differently sized subsets of your training data.
For each training data subset, using regularization,
~ train a model,
~ then calculate the error on the subset and the error on the validation set.

Once that's done,
plot the unregularized cost for both training and validation sets as a function of the size of the training data subset.

The idea is that, if the training error and validation error remain very different at large training set sample sizes then the model has high variance.
If training error and validation error converge too quickly then the model has high bias.

My question is…
Since the models we're testing were calculated using a regularization constant, why aren't we plotting regularized cost as a function of the training data size?

Best Answer

Background: I believe you are referring to this lecture dealing with Regularization and Bias/Variance in the context of polynomial regression.

The algorithm fmincg produces optimized estimated $\hat \theta$ coefficients (or parameters), based on a gradient descent computation derived from the objective function:

$$J(\theta)=\frac{1}{2m}\left(\displaystyle\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})^2\right)+\frac{\lambda}{2m}\left(\sum_{j=i}^n\theta_j^2\right)$$

where $m$ is the number of examples (or subjects/observations), each denoted as $x^{(i)}$; $j$ the number of features; and $\lambda$ the regularization parameter. The optimization gradients include the regularization $\lambda$ for each parameter other than $\theta_0$: it is found in the expression: $\frac{\lambda}{m}\theta_j$ after differentiating the equation above.


The issue at hand is to select the optimal $\lambda$ value to prevent overfitting the data, but also avoiding high bias.

To this end, a vector of possible lambda values is supplied, which in the course exercise is $[0,0.001,0.003,0.01,0.03,0.1,0.3,1,3,10]$, to optimize the coefficients $\Theta$. In this process, and for each iteration through the different lambda values, all other factors (basically the model matrix) remain constant.

Consequently, the differences between the $\Theta$ vectors of parameters that will be obtained are a direct consequence of the different regularization parameters $\lambda$ chosen.

At each iteration and using gradient descent the parameters that minimize the objective function are calculated on the entire training set to eventually plot a validation curve of squared errors over lambda values. This is different than in the case of the learning curves (cost vs. number of examples), where the training set is segmented in increasing numbers of observations as explained right here.

At this point, we have obtained optimal estimated parameters on the training set, and their differences are directly related to the regularization parameter.

Therefore, it makes sense to now set aside the regularization and see what would be the cost or errors, applying each different set of $\Theta$'s to both the training and cross validation sets, looking for a minimum in the crossvalidation set errors. We are not looking to optimize further the parameters $\theta$, we are just checking how the choice of different $\lambda$ values (with its associated coefficients) is reflected in the loss (or cost) function, initially dropping the errors, but eventually, and after having taken care of overfitting, progressively increasing these errors due to bias:

This explains why the training error (cost or loss function) is defined as:

$$J_{train}=\frac{1}{2m}\left[\displaystyle\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})^2\right]$$

and accordingly, the CV error as:

$$J_{cv}=\frac{1}{2m}\left[\displaystyle\sum_{i=1}^m(h_{\theta}(x^{(i)}_{cv})-y^{(i)}_{cv})^2\right]$$

Basically, the squared errors. In a way the confusion stems from the similarity between the function to minimize by choosing optimal parameters (objective function), and the cost or loss function, meant to assess the errors.