Solved – Why don’t we just learn the hyper parameters

hyperparametermachine learningneural networks

I was implementing a pretty popular paper "EXPLAINING AND HARNESSING ADVERSARIAL EXAMPLES" and in the paper, it trains an adversarial objective function

J''(θ) = αJ(θ) + (1 − α)J'(θ).

It treats α as a hyperparameter. α can be 0.1, 0.2, 0.3, etc.

Regardless of this specific paper, I'm wondering, why don't we just include α into our parameters and learn the best α?

What is the disadvantage to do so?
Is it because of overfitting? If so, why does learning just 1 more parameter cause so much overfitting?

Best Answer

Lets see how the first order condition would look like if we plug the hyperparameter $\alpha$ and try to learn it the same way as $\theta$ from data: $$\frac \partial{\partial\alpha} J''(\theta) = \frac \partial{\partial\alpha}\alpha J(\theta) + \frac \partial{\partial\alpha}(1 − \alpha)J'(\theta)\\ = J(\theta) − J'(\theta) = 0$$ Hence, $$J(\theta) = J'(\theta)$$

When this hyperparameter is optimiized, then it will cause both J and J' become the same function, i.e. equal weights. You'll end up with a trivial solution.

If you want a more generic philosophizing then consider this: hyperparameters are usually not tangled with data. What do I mean? In a neural network or even a simple regression your model parameters will be in some ways interacting directly with data: $$y_L=X_L\beta_L$$ $$a_L=\sigma(y_L)$$ $$X_{L+1}=a_L$$ and so on down the layers. You see how $\beta_L$ get tangled in your data. So, when you take a derivative over any $\beta$ of the objective function you get data points entering the result in non obvious ways in matrix, Hessians, cross products etc.

However, if you try to estimate the first order conditions over the hyperparameters, you don'y get this effect. The derivatives of hyperparameters often operate the entire chunks of your model, without shuffling its parts like derivatives over parameters. That's why optimizing hyperparameters often leads to trivial solutions like the one I gave you for the specific paper. Optimizing hyperparameters doesn't distress your data set and make it uncomfortable enough to produce something interesting.

Related Question