Solved – What’s in a name: hyperparameters

definitionhyperparameterparameterizationterminology

So in a normal distribution, we have two parameters: mean $\mu$ and variance $\sigma^2$. In the book Pattern Recognition and Machine Learning, there suddenly appears a hyperparameter $\lambda$ in the regularization terms of the error function.

What are hyperparameters? Why are they named as such? And how are they intuitively different from parameters in general?

Best Answer

The term hyperparameter is pretty vague. I will use it to refer to a parameter that is in a higher level of the hierarchy than the other parameters. For an example, consider a regression model with a known variance (1 in this case)

$$ y \sim N(X\beta,I) $$

and then a prior on the parameters, e.g.

$$ \beta \sim N(0,\lambda I) $$

Here $\lambda$ determines the distribution of $\beta$ and $\beta$ determines the distribution for $y$. When I want to just refer to $\beta$ I may call it the parameter and when I want to just refer to $\lambda$, I may call it the hyperparameter.

The naming gets more complicated when parameters show up on multiple levels or when there are more hierarchical levels (and you don't want to use the term hyperhyperparameters). It is best if the author's specify exactly what is meant when they use the term hyperparameter or parameter for that matter.

Related Solutions

Solved – difference between on-line learning, incremental learning and sequential learning

Regarding the first two, online and incremental learning, some authors underline that

On-line has to discard a sample after learning (no memory) and unlike to incremental learning is not allowed to store it. Source: this paper

Solved – What are Regularities and Regularization

Regularization is employed in almost all machine learning algorithms where we're trying to learn from finite samples of training data.

I'll attempt to indirectly answer your specific questions by explaining the genesis of the concept of regularization. The full theory is much more detailed and this explanation should not be interpreted as complete, but its intended to simply point you in the right direction for further exploration. Since your primary objective is to get an intuitive understanding of regularization, I've summarized and heavily simplified the following explanation from Chapter 7 of "Neural Networks and Learning Machines", 3rd edition by Simon Haykin (and omitted several details while doing so).

Lets revisit the supervised learning problem with independent variables $x_i$ and dependent variable $y_i$ as trying to find a function $f$ that will be able to "map" the input X to an output Y.

To take this further, lets understand Hadamard's terminology of a "well-posed" problem - a problem is well-posed if it satisfies the following three conditions:

For every input $x_i$, and output $y_i$ exists.
For a pair of inputs $x_1$ and $x_2$, $f(x_1) = f(x_2)$ if and only if $x_1 = x_2$.
The mapping $f$ is continuous (stability criteria)

For supervised learning, these conditions may be violated since:

A distinct output may not exist for a given input.
There may not be enough information in the training samples to construct a unique input-output mapping (since running the learning algorithm on different training samples results in different mapping functions).
Noise in the data adds uncertainty to the reconstruction process which may effect its stability.

For solving such "ill-posed" problems, Tikhonov proposed a regularization method to stabilize the solution by including a non-negative functional that embeds prior information about the solution.

The most common form of prior information involves the assumption that the input-output mapping function is smooth - i.e. similar inputs produce similar outputs.

Tikhnov's regularization theory adds the regularization term to the cost function (loss function to be minimized) which includes the regularization parameter $\lambda$ and the assumed form of the mapping $f$. The value of $\lambda$ is chosen between 0 and $\infty$. A value of 0 implies the solution is determined completely from the training samples; whereas a value of $\infty$ implies the training examples are unreliable.

So the regularization parameter $\lambda$ is selected and optimized to achieve the desired balance between model bias and model variance by incorporating the right amount of prior information into it.

Some examples of such regularized cost functions are:

Linear Regression:

$ J(\theta) = \frac 1m \sum_{i=1}^m [ h_\theta(x^i) - y^i]^2 + \frac \lambda{2m} \sum_{j=1}^n \theta_j^2 $

Logistic Regression:

$ J(\theta) = \frac 1m \sum_{i=1}^m [ -y^i log(h_\theta(x^i)) - (1-y^i)log(1 - h_\theta(x^i))] + \frac \lambda{2m} \sum_{j=1}^n \theta_j^2 $

Where, $\theta$ are the coefficients we've identified for $x$ , and $h_\theta(x)$ is the estimate of $y$ .

The second summation term in each example is the regularization term. Since this term is always a non-negative value, it stops the optimizer from reaching the global minima for the cost function. The form of the term shown here is an $L_2$ regularization. There are many variations in the form of the regularization function, the commonly used forms are: lasso, elastic net and ridge regression. These have their own advantages and disadvantages which help decide where their best applicability.

The net effect of applying regularization is to reduce model complexity which reduces over-fitting. Other approaches to regularization (not listed in the examples above) include modifications to structural models such as regression/classification Trees, boosted trees, etc. by dropping out nodes to make simpler trees. More recently this has been applied in so-called "deep learning" by dropping out connections between neurons in a neural network.

A specific answer to Q3 is that some ensembling methods such as Random Forest (or similar voting schemes) achieve regularization due to their inherent method, i.e. voting and electing the response from a collection of un-regularized Trees. Even though the individual trees have overfit, the process of "averaging out" their outcome stops the ensemble from overfitting to the training set.

EDIT:

The concept of regularity belongs to axiomatic set theory, you could refer to this article for pointers - en.wikipedia.org/wiki/Axiom_of_regularity and explore this topic further if you're interested in the details.

On regularization for neural nets: When adjusting the weights while running the back-propagation algorithm, the regularization term is added to the cost function in the same manner as the examples for linear and logistic regression. So the addition of the regularization term stops the back-propagation from reaching the global minima.

The article describing batch normalization for neural networks is - Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Ioffe, Szegedy, 2015. Its been known that backpropagation to train a neural network works better when the input variables are normalized. In this paper, the authors have applied normalization to each mini-batch used in Stochastic Gradient Descent to avoid the problem of "vanishing gradients" when training many layers of a neural network. The algorithm described in their paper treats the mean and variance computed in each batch for each layer of activations as another set of parameters optimized in mini-batch SGD (in addition to the NN weights). The activations are then normalized using the entire training set. You may refer to their paper for full details of this algorithm. By using this method, they were able to avoid using dropouts for regularization, and hence their claim that this is another type of regularization.

Best Answer

Related Solutions

Solved – difference between on-line learning, incremental learning and sequential learning

Solved – What are Regularities and Regularization

Related Question