Regression – How Does the Size of Training Set Affect Regularization Parameter Found by Cross-Validation?

cross-validationoverfittingregressionregularization

Is it true that:

Suppose you perform linear regression with $L_2$ regularization and use cross-validation to select
the value of the regularization parameter $λ$ on two datasets drawn from the same distribution :
$D_1$ of 500 examples and $D_2$ of 50,000 examples. The value of lambda found by cross-validation
will likely be higher on $D_2$ than on $D_1$.

Best Answer

Regularization constrains the parameter space of a model that would otherwise overfit the sample. The optimal amount of regularization depends on the model complexity relative to the sample size ($n$). Namely, the smaller the sample and/or the complexer the model, the more prone the model is to overfitting. Cross-validation should result in a model that is regularized to the extent that it no longer overfits. Hence, the optimal value $\lambda_{\text{CV}}$ grows roughly with the ratio $\frac{p}{n}$, where $p$ is the number of parameters.

Because of this, if the same model is fit on on $D_1$ ($n=500$) and $D_2$ ($n=50,000$), then the optimal value of $\lambda$ will almost certainly be lower for $D_2$ than for $D_1$, because $p$ is constant, and $\frac{p}{500} > \frac{p}{50,000}$.

Put differently, the relative model complexity is lower when you have more observations, so $D_2$ requires less regularization (if any) to combat overfitting.