Solved – Weight Decay in Neural Networks

neural networks

Can weights decay in neural network be negative?I know that usually it takes a value slightly less than 1.

Best Answer

This does not make sense. Let's consider (without loss of generality) L2-regularizer. In this case the (regularized) error function to be minimized takes form $$\widetilde{J}(\mathbf{w})=J(\mathbf{w})+\lambda\|\mathbf{w}\|_2^2.$$ Now if $\lambda<0$, $\widetilde{J}$ can be minimized trivially by letting $\|\mathbf{w}\|_2 \rightarrow +\infty$, and Neural Network won't learn at all. So only non-negative values of $\lambda$ are of interest.

Regarding $\lambda<1$, this actually depends on scale of the data, and typically the optimal value of $\lambda$ is found by cross-validation.

UPDATE: Even though $\lambda<0$ does indeed have no sence, the explanation was not completely precise. $J(\mathbf{w})$ might also go to infinity with this of $\|\mathbf{w}\|_2$.

Let's consider a simple example, linear regression $\mathbb{R}\ni\widehat{y}(\mathbf{x}):=\mathbf{w}\cdot\mathbf{x}+b,\mathbf{w}\in\mathbb{R}^d$ for only 1 data pair $(\mathbf{x},y)$. Loss function is $J(\mathbf{w})=(\mathbf{w}\cdot\mathbf{x}+b-y)^2$. A set $H_{\mathbf{x},\alpha}=\{\mathbf{w}\in\mathbb{R}^d | \mathbf{w}\cdot\mathbf{x}=\alpha\}$ defines $(d-1)$-dimensional hyperplane in $\mathbb{R}^d$ (linear subspace). If we choose any $\mathbf{w}\in H_{\mathbf{x},y-b}$, then $J(\mathbf{w})=0$ and we can let $\|\mathbf{w}\|_2 \rightarrow +\infty$ by keeping it in $H_{\mathbf{x},y-b}$. In case we have multiple points $\{(\mathbf{x}_1,y_1)\ldots(\mathbf{x}_N,y_N)\}$ we can perform the same for $\mathbf{w}\in H:=\cap_{n=1}^N H_{\mathbf{x}_n,y_n-b}$ provided linear subspace $H$ is non-empty (which is violated already for $N>d$ for linearly independent $\mathbf{x}_n$'s).

Thus $\widetilde{J}(\mathbf{w})$ might or might not be optimized in this way. But anyway, back to the original question, $\lambda<0$ has nothing to do with regularization goal of weight decay, which is to keep weights small. Vice versa, $\lambda<0$ will favor larger weights, which will possibly result in overfitting. Therefore, in this case the generalization performance of the machine learning algorithm is expected to be worse.

Best Answer

Related Solutions

Solved – Gradient decay in neural networks

Solved – Weight Decay in Neural Neural Networks Weight Update and Convergence

Related Question