Solved – Weight Decay in Neural Networks

neural networks

Can weights decay in neural network be negative?I know that usually it takes a value slightly less than 1.

Best Answer

This does not make sense. Let's consider (without loss of generality) L2-regularizer. In this case the (regularized) error function to be minimized takes form $$\widetilde{J}(\mathbf{w})=J(\mathbf{w})+\lambda\|\mathbf{w}\|_2^2.$$ Now if $\lambda<0$, $\widetilde{J}$ can be minimized trivially by letting $\|\mathbf{w}\|_2 \rightarrow +\infty$, and Neural Network won't learn at all. So only non-negative values of $\lambda$ are of interest.

Regarding $\lambda<1$, this actually depends on scale of the data, and typically the optimal value of $\lambda$ is found by cross-validation.

UPDATE: Even though $\lambda<0$ does indeed have no sence, the explanation was not completely precise. $J(\mathbf{w})$ might also go to infinity with this of $\|\mathbf{w}\|_2$.

Let's consider a simple example, linear regression $\mathbb{R}\ni\widehat{y}(\mathbf{x}):=\mathbf{w}\cdot\mathbf{x}+b,\mathbf{w}\in\mathbb{R}^d$ for only 1 data pair $(\mathbf{x},y)$. Loss function is $J(\mathbf{w})=(\mathbf{w}\cdot\mathbf{x}+b-y)^2$. A set $H_{\mathbf{x},\alpha}=\{\mathbf{w}\in\mathbb{R}^d | \mathbf{w}\cdot\mathbf{x}=\alpha\}$ defines $(d-1)$-dimensional hyperplane in $\mathbb{R}^d$ (linear subspace). If we choose any $\mathbf{w}\in H_{\mathbf{x},y-b}$, then $J(\mathbf{w})=0$ and we can let $\|\mathbf{w}\|_2 \rightarrow +\infty$ by keeping it in $H_{\mathbf{x},y-b}$. In case we have multiple points $\{(\mathbf{x}_1,y_1)\ldots(\mathbf{x}_N,y_N)\}$ we can perform the same for $\mathbf{w}\in H:=\cap_{n=1}^N H_{\mathbf{x}_n,y_n-b}$ provided linear subspace $H$ is non-empty (which is violated already for $N>d$ for linearly independent $\mathbf{x}_n$'s).

Thus $\widetilde{J}(\mathbf{w})$ might or might not be optimized in this way. But anyway, back to the original question, $\lambda<0$ has nothing to do with regularization goal of weight decay, which is to keep weights small. Vice versa, $\lambda<0$ will favor larger weights, which will possibly result in overfitting. Therefore, in this case the generalization performance of the machine learning algorithm is expected to be worse.

Related Question