Suppose you are trying to minimize the objective function via number of iterations. And current value is $100.0$. In given data set, there are no "irreducible errors" and you can minimize the loss to $0.0$ for your training data. Now you have two ways to do it.
The first way is "large learning rate" and few iterations. Suppose you can reduce loss by $10.0$ in each iteration, then, in $10$ iterations, you can reduce the loss to $0.0$.
The second way would be "slow learning rate" but more iterations. Suppose you can reduce loss by $1.0$ in each iteration and you need $100$ iteration to have 0.0 loss on your training data.
Now think about this: are the two approaches equal? and if not which is better in optimization context and machine learning context?
In optimization literature, the two approaches are the same. As they both converge to optimal solution. On the other hand, in machine learning, they are not equal. Because in most cases we do not make the loss in training set to $0$ which will cause over-fitting.
We can think about the first approach as a "coarse level grid search", and second approach as a "fine level grid search". Second approach usually works better, but needs more computational power for more iterations.
To prevent over-fitting, we can do different things, the first way would be restrict number of iterations, suppose we are using the first approach, we limit number of iterations to be 5. At the end, the loss for training data is $50$. (BTW, this would be very strange from the optimization point of view, which means we can future improve our solution / it is not converged, but we chose not to. In optimization, usually we explicitly add constraints or penalization terms to objective function, but usually not limit number of iterations.)
On the other hand, we can also use second approach: if we set learning rate to be small say reduce $0.1$ loss for each iteration, although we have large number of iterations say $500$ iterations, we still have not minimized the loss to $0.0$.
This is why small learning rate is sort of equal to "more regularizations".
Here is an example of using different learning rate on an experimental data using xgboost
. Please check follwoing two links to see what does eta
or n_iterations
mean.
Parameters for Tree Booster
XGBoost Control overfitting
For the same number of iterations, say $50$. A small learning rate is "under-fitting" (or the model has "high bias"), and a large learning rate is "over-fitting" (or the model has "high variance").
PS. the evidence of under-fitting is both training and testing set have large error, and the error curve for training and testing are close to each other. The sign of over-fitting is training set's error is very low and testing set is very high, two curves are far away from each other.
Well, the parameters that represent higher exponentials (x3,x4) are drasticly increasing the complexity of our model. So shouldn't we penalize more for high w3,w4 values than we penalize for high w1,w2 values?
The reason we say that adding quadratic or cubic terms increases model complexity is that it leads to a model with more parameters overall. We don't expect a quadratic term to be in and of itself more complex than a linear term. The one thing that's clear is that, all other things being equal, a model with more covariates is more complex.
For the purposes of regularization, one generally rescales all the covariates to have equal mean and variance so that, a priori, they are treated as equally important. If some covariates do in fact have a stronger relationship with the dependent variable than others, then, of course, the regularization procedure won't penalize those covariates as strongly, because they'll have greater contributions to the model fit.
But what if you really do think a priori that one covariate is more important than another, and you can quantify this belief, and you want the model to reflect it? Then what you probably want to do is use a Bayesian model and adjust the priors for the coefficients to match your preexisting belief. Not coincidentally, some familiar regularization procedures can be construed as special cases of Bayesian models. In particular, ridge regression is equivalent to a normal prior on the coefficients, and lasso regression is equivalent to a Laplacian prior.
Best Answer
Did you look at the distribution of the classes... It may most likely be due imbalanced class distibutions. For example, if you sample contain two class labels 'A', 'B' and if 'A' occurs 80% of the times in your dataset. Assume that your classifier almost always classifies any test data as beloning to class 'A'. Then your training accuracy score is most likely to be around 0.8., However, since, you are chosing your test samples in random, if by some means, the number of samples belonging to Class 'A' is more than than the number of samples belonging to Class 'B', assuming a 90/10 ratio, then your test accuracy would be 0.9 i.e test accuracy > training accuracy.
Typically, you'd have a low cross validation score and if you are using python scikit-learn and use StratifiedKFold, for some values of K you would receive warning messages.