If you use regularization you're not only minimizing the in-sample error but $OutOfSampleError \le InSampleError + ModelComplexityPenalty$.
More precisely, $J_{aug}(h(x),y,\lambda,\Omega)=J(h(x),y)+\frac{\lambda}{2m}\Omega$ for a hypothesis $h \in H$, where $\lambda$ is some parameter, usually $\lambda \in (0,1)$, $m$ is the number of examples in your dataset, and $\Omega$ is some penalty that is dependent on the weights $w$, $\Omega=w^Tw$. This is known as the augmented error. Now, you can only minimize the function above if the weights are rather small.
Here is some R code to toy with
w <- c(0.1,0.2,0.3)
out <- t(w) %*% w
print(out)
So, instead of penalizing the whole hypothesis space $H$, we penalize each hypothesis $h$ individually. We sometimes refer to the hypothesis $h$ by its weight vector $w$.
As for why small weights go along with low model complexitity, let's look at the following hypothesis: $h_1(x)=x_1 \times w_1 + x_2 \times w_2 + x_3 \times w_3$. In total we got three active weight parameters ${w_1,\dotsc,w_3}$. Now, let's set $w_3$ to a very very small value, $w_3=0$. This reduces the model's complexity to: $h_1(x)=x_1 \times w_1 + x_2 \times w_2$. Instead of three active weight parameters we only got two remaining.
Regularization (traditionally in the context of shrinkage) adds prior knowledge to a model; a prior, literally, is specified for the parameters. Augmentation is also a form of adding prior knowledge to a model; e.g. images are rotated, which you know does not change the class label. Increasing training data (as with augmentation) decreases a model's variance. Regularization also decreases a model's variance. They do so in different ways, but ultimately both decrease regularization error.
Section 5.2.2 of Goodfellow et al's Deep Learning proposes a much broader definition:
Regularization is any modiļ¬cation we make to a learning algorithm that
is intended to reduce its generalization error but not its training
error.
There is a tendency to asssociate regularization with shrinkage because of the term "l-p norm regularization"...perhaps "augmentation regularization" is equally valid, although it doesn't roll off the tongue.
Best Answer
Your description of regularization is correct but overfitting and getting stuck in a local minimum are not the same thing. You could get stuck in a local minimum yet generalize well.
A good way to think about the difference between overfitting and local minima is to think of the following setup: imagine you have a problem where the solution space has a lot of local minima and doesn't have a clear optimal gradient direction to the solution. You set up three identical neural networks in terms of amount of layers and parameters. However the second network adds weight decay and the third adds both weight decay as well as using an evolution strategy to calculate and optimize the gradients (rather than backpropagation with gradient descent). While both the second and third network should generalize better than the first, the third might also be able better escape local minima and reach a better global solution.
Also, regularization strategies don't necessarily converge faster, it depends very much on the problem and strategy.