Machine Learning – Reasons Not to Prune Your Neural Network

machine learningneural networksoptimizationpruning

Han et al. (2015) used a method of iterative pruning to reduce their network to only 10% of its original size with no loss of accuracy by removing weights with very low values, since these changed very little. As someone new to the machine learning area, why wouldn't you do this (unless your network is already very small)? It seems to me that for deep learning your network would be smaller, faster, more energy efficient, etc. at no real cost. Should we all use this method for larger neural networks?

Best Answer

Pruning is indeed remarkably effective and I think it is pretty commonly used on networks which are "deployed" for use after training.

The catch about pruning is that you can only increase efficiency, speed, etc. after training is done. You still have to train with the full size network. Most computation time throughout the lifetime of a model's development and deployment is spent during development: training networks, playing with model architectures, tweaking parameters, etc. You might train a network several hundred times before you settle on the final model. Reducing computation of the deployed network is a drop in the bucket compared to this.

Among ML researchers, we're mainly trying to improve training techniques for DNN's. We usually aren't concerned with deployment, so pruning isn't used there.

There is some research on utilizing pruning techniques to speed up network training, but not much progress has been made. See, for example, my own paper from 2018 which experimented with training on pruned and other structurally sparse NN architectures: https://arxiv.org/abs/1810.00299

Related Solutions

Regularization – Why Do Smaller Weights Result in Simpler Models?

If you use regularization you're not only minimizing the in-sample error but $OutOfSampleError \le InSampleError + ModelComplexityPenalty$.

More precisely, $J_{aug}(h(x),y,\lambda,\Omega)=J(h(x),y)+\frac{\lambda}{2m}\Omega$ for a hypothesis $h \in H$, where $\lambda$ is some parameter, usually $\lambda \in (0,1)$, $m$ is the number of examples in your dataset, and $\Omega$ is some penalty that is dependent on the weights $w$, $\Omega=w^Tw$. This is known as the augmented error. Now, you can only minimize the function above if the weights are rather small.

Here is some R code to toy with

w <- c(0.1,0.2,0.3)
out <- t(w) %*% w
print(out)

So, instead of penalizing the whole hypothesis space $H$, we penalize each hypothesis $h$ individually. We sometimes refer to the hypothesis $h$ by its weight vector $w$.

As for why small weights go along with low model complexitity, let's look at the following hypothesis: $h_1(x)=x_1 \times w_1 + x_2 \times w_2 + x_3 \times w_3$. In total we got three active weight parameters ${w_1,\dotsc,w_3}$. Now, let's set $w_3$ to a very very small value, $w_3=0$. This reduces the model's complexity to: $h_1(x)=x_1 \times w_1 + x_2 \times w_2$. Instead of three active weight parameters we only got two remaining.

Solved – Why scale cost functions by 1/n in a neural network

In terms of mini-batch learning, $n$ should be the size of the batch instead of the total amount of training data (which in your case should be infinite).

Gradients are scaled by $1/n$ because we are taking the average of the batch, so the same learning rate can be used regardless of the size of the batch.

Edit

I found this later in that page, which shows how to set hyper-parameters in the context of mini-batch learning. So it might seem that the formula and the paragraph you refer to are talking about full batch learning, in which case $n$ should be the total number of training data.

Edit

So now the question would become "why you use the total training data size in regularization, instead of the batch size?"

I don't know whether there's a theoretical interpretation for this, for me it is just to make the hyper-parameters stay irrelevant to the batch size, since we are not summing the regularization term across the batch.

Best Answer

Related Solutions

Regularization – Why Do Smaller Weights Result in Simpler Models?

Solved – Why scale cost functions by 1/n in a neural network

Related Question