Background: I believe you are referring to this lecture dealing with Regularization and Bias/Variance in the context of polynomial regression.
The algorithm fmincg
produces optimized estimated $\hat \theta$ coefficients (or parameters), based on a gradient descent computation derived from the objective function:
$$J(\theta)=\frac{1}{2m}\left(\displaystyle\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})^2\right)+\frac{\lambda}{2m}\left(\sum_{j=i}^n\theta_j^2\right)$$
where $m$ is the number of examples (or subjects/observations), each denoted as $x^{(i)}$; $j$ the number of features; and $\lambda$ the regularization parameter. The optimization gradients include the regularization $\lambda$ for each parameter other than $\theta_0$: it is found in the expression: $\frac{\lambda}{m}\theta_j$ after differentiating the equation above.
The issue at hand is to select the optimal $\lambda$ value to prevent overfitting the data, but also avoiding high bias.
To this end, a vector of possible lambda values is supplied, which in the course exercise is $[0,0.001,0.003,0.01,0.03,0.1,0.3,1,3,10]$, to optimize the coefficients $\Theta$. In this process, and for each iteration through the different lambda values, all other factors (basically the model matrix) remain constant.
Consequently, the differences between the $\Theta$ vectors of parameters that will be obtained are a direct consequence of the different regularization parameters $\lambda$ chosen.
At each iteration and using gradient descent the parameters that minimize the objective function are calculated on the entire training set to eventually plot a validation curve of squared errors over lambda values. This is different than in the case of the learning curves (cost vs. number of examples), where the training set is segmented in increasing numbers of observations as explained right here.
At this point, we have obtained optimal estimated parameters on the training set, and their differences are directly related to the regularization parameter.
Therefore, it makes sense to now set aside the regularization and see what would be the cost or errors, applying each different set of $\Theta$'s to both the training and cross validation sets, looking for a minimum in the crossvalidation set errors. We are not looking to optimize further the parameters $\theta$, we are just checking how the choice of different $\lambda$ values (with its associated coefficients) is reflected in the loss (or cost) function, initially dropping the errors, but eventually, and after having taken care of overfitting, progressively increasing these errors due to bias:
This explains why the training error (cost or loss function) is defined as:
$$J_{train}=\frac{1}{2m}\left[\displaystyle\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})^2\right]$$
and accordingly, the CV error as:
$$J_{cv}=\frac{1}{2m}\left[\displaystyle\sum_{i=1}^m(h_{\theta}(x^{(i)}_{cv})-y^{(i)}_{cv})^2\right]$$
Basically, the squared errors. In a way the confusion stems from the similarity between the function to minimize by choosing optimal parameters (objective function), and the cost or loss function, meant to assess the errors.
First of all, while I'd usually agree that hold-out is not making efficient use of the available samples and the typical set-up is prone to the same mistakes as cross validation, repeated set validation / repeated hold-out is a resampling technique that I think is well suitable for your learning curve calculation. This way, you can reflect what is going on inside the data set you have covering the variation due to different splits (but not fully the variation you'd have to expect with new data set of size $n$). This way you get the fine-grained control over training set size of hold out together with resampling properties close to k-fold.
However, here's a caveat for the informed decision: in case you are talking about small sample size classification, the usual figures of merit (sensitivity, specificity, overall accuracy etc.) are subject to very high testing variance. This testing variance is limited by the number of actual independent cases you have in the denominator of the calculation and can easily be so large that you cannot sensibly use such measured learning curves (keep in mind, "use" typically means extrapolation).
See our paper for details: Beleites, C. and Neugebauer, U. and Bocklitz, T. and Krafft, C. and Popp, J.: Sample size planning for classification models. Anal Chim Acta, 2013, 760, 25-33.
DOI: 10.1016/j.aca.2012.11.007
accepted manuscript on arXiv: 1211.1323
Best Answer
The training error refers to the error found when testing an algorithm on the data it was trained with. The training error curve slopes up because with very few training samples in relation to the number of features the model can over fit the training data and create a near perfect fit. As the number of training examples increases the model can no longer perfectly fit the data.
Suppose you are classifying email as spam or not spam and you have only 4 features. Lets say the features are if it contains the words buy, deal, offer, or try. There are 2^4 = 16 possible combinations of feature vectors. Now if you have 10 training examples it is feasible they could all have a unique combination of feature values. So when a model is trained on this data it is possible to exactly fit the training examples and the training error will be 0. Now if you use 100 training examples instead this is no longer possible. Some of the training examples will have the same feature vector and if they have different classifications the training error will increase.