I came across "Tikhonov regularization" and I have bare knowledge on it. It seems that it is a type of regularization that is important for deconvolution. Are there any good resources and examples? Also, can you explain briefly what it is?
Solved – Tikhonov regularization in the context of deconvolution
convolutionmachine learningreferencesregularizationtikhonov-regularization
Related Solutions
I think so.
The Fourier transform of a Gaussian is a Gaussian, so we can go to the f domain.
Furthermore, dividing a Gaussian by a Gaussian (deconvolution in the f domain) also yields a Gaussian, so we now have a Gaussian f domain quotient.
The inverse Fourier of a Gaussian is also a Gaussian.
Voila (I won't say QED as this is far from formal!!!)
Regarding the parameters I would say yes as well.
If you are interested in pursuing neural networks (or other machine learning), you would be well served to spend some time on the mathematical fundamentals. I started skimming this trendy new treatise, which IMO is not too difficult mathematically (although I have seen some complain otherwise, and I was a math major, so YMMV!).
That said, I will try to address your questions as best I can at a high level.
- Can L1, L2 and Tikhonov calculations be appended as extra terms directly to a cost function, such as MSE?
Yes, these are all penalty methods which add terms to the cost function.
- If #1 is correct, is is possible to apply them in supervised learning situations where there are no cost functions?
I am not sure what you are thinking here, but in my mind the supervised case is the most clear in terms of cost functions and regularization. In general you can think of the error conceptually as
$$\text{Total Error}=\left(\text{ Data Error }\right) + \left(\text{ Prior Error }\right)$$
where the first term penalizes the misfit of the model predictions vs. the training data, while the second term penalizes overfitting of the model (with the goal of lowering generalization errors).
For a problem where features $x$ are used to predict data $y$ via a function $f(x,w)$ parameterized by weights $w$, the above cost function will typically be realized mathematically by something like
$$E[w]=\|\,f(x,w)-y\,\|_p^p + \|\Lambda w\|_q^q$$
Here the terms have the same interpretation as above, with the errors measured by "$L_p$ norms": The data misfit usually has $p=2$, which corresponds to MSE, while the regularization term may use $q=2$ or $q=1$. (The $\Lambda$ I will get to below.)
The latter corresponds to absolute(-value) errors rather than square errors, and is commonly used to promote sparsity in the solution vector $w$ (i.e. many 0 weights, effectively limiting the connectivity paths). The $L_1$ norm can be used for the data misfit term also, typically to reduce sensitivity to outliers in the data $y$. (Conceptually, predictions will target the median of $y$ rather than the mean of $y$.)
Note that in many unsupervised learning scenarios there are effectively two "machines", with each taking turns as "data" vs. "prediction" (e.g. the coder and decoder parts of an autoencoder).
- I've gathered that L2 is a special case of Tikhonov Regularization ...
The two can be used as synonyms. Some communities tend to use "$L_2$" to refer to the special case where the Tikhonov matrix $\Lambda$ is simply a scalar $\lambda$. Terminology varies quite a bit.
- L2 is differentiable and therefore compatible with gradient descent ... however, it appears that we can substitute other operations like difference and Fourier operators to derive other brands of Tikhonov Regularization that are not equivalent to L2.
This is not correct. Tikhonov regularization always uses the $L_2$ norm, so is always a differentiable $L_2$ regularization. The matrix $\Lambda$ does not impact this (it is constant, it does not matter if it is a scalar, a diagonal covariance, a finite difference operator, a Fourier transform, etc.).
For differentiability, the comparison is typically between $L_2$ and $L_1$. You can think of it like $$(x^2)'=2x \quad \text{ vs. } \quad |x|'=\mathrm{sgn}(x)$$ i.e. the $L_2$ norm has a continuous derivative while the $L_1$ norm has a discontinuous derivative.
This difference is not as big as you might imagine in terms of optimize-ability. For example the popular ReLU activation function also has a discontinuous derivative, i.e. $$\max(0,x)'=(x>0)$$ More importantly, the $L_1$ regularizer is convex, so helps suppres local minima in the composite cost function. (Note I am avoiding technical issues around convex optimization & differentiability, such as "subgradients", as these are not essential to the point.)
- If #4 is true, can someone post an example of a Tikhonov formula that is not equivalent to L2, yet would be useful in neural nets?
This was answered above: No, because all Tikhonov regularizers use the $L_2$ norm. (Wether they are useful for NN or not!)
- There's are arcane references in the literature to a "Tikhonov Matrix," which I can't seem to find any definitions of (although I've run across several unanswered questions scattered across the Internet about its meaning). An explanation of its meaning would be helpful.
This is simply the matrix $\Lambda$. There are many forms that it can take, depending on the goals of the regularization:
In the simplest case it is simply a constant ($\lambda>0$). This penalizes large weights (i.e. promotes $w_i^2\approx 0$ on average), which can be useful to prevent over-fitting.
In the next simplest case, $\Lambda$ is diagonal, which allows per-weight regularization (i.e. $\lambda_iw_i^2\approx 0$). For example the regularization might vary with level in a deep network.
Many other forms are possible, so I will end with one example of a sparse but non-diagonal $\Lambda$ that is common: A finite difference operator. This only makes sense when the weights $w$ are arranged in some definite spatial pattern. In this case $\Lambda w\approx \nabla w$, so the regularization promotes smoothness (i.e. $\nabla w\approx 0$). This is common in image processing (e.g. tomography), but could conceivably be applied in some types of neural-network architectures as well (e.g. ConvNets).
In the last case, note that this type of "Tikhonov" matrix $\Lambda$ can really be used in $L_1$ regularization as well. A great example I did once was with the smoothing example in an image processing context: I had a least squares ($L_2$) cost function set up to do denoising, and by literally just putting the same matrices into a "robust least squares" solver* I got an edge-preserving smoother "for free"! (*Essentially this just changed the $p$'s and $q$'s in my first equation from 2 to 1.)
I hope this answer has helped you to understand regularization better! I would be happy to expand on any part that is still not clear.
Best Answer
Deconvolution is part of a more general class of problems called inverse problems. In the case of deconvolution you want to recover the original image from a given one which is affected by noise after being altered by some process/system modeled through a filter, that is, $$ observed(x) = h * input(x) + \epsilon $$ where the noise if often assumed to be white noise. Notice that this problem is ill-posed. First, assume that the image is not affected by noise. h is usually a blur filter which is band limited. If you calculate the Fourier transform, $$ Observed(\omega) = H(\omega)Input(\omega) $$ which is not defined for values outside of the band. Hence, you need to introduce and additional bound on the filter in order to be able to solve it. If you have noise in addition, $$ Input(\omega) = \frac{Output(\omega)}{H(\omega)} + \frac{Noise(\omega)}{H(\omega)} $$ so that variations in the noise produce very different solutions.
The idea of Tikhonov regularization is to stabilize the problem by adding some constraint on the possible solutions. This constraint presents some a priori knowledge about the problem. Concretely, you would solve, $$ H(f) = \sum_{i}\left(f(x_{i})-y_{i}\right)^{2} + \lambda ||Df||^{2} $$ where $D$ is a differential operator like for example $\frac{d^{2}}{dx^{2}}$. This condition is basically imposing some smoothness on the filter response. See the paper by Poggio et al. (Regularization Neworks and Neural Networks Architectures) for a detailed derivation and some concrete cases.
Now it can be proved that this results in the following solution, $$ f(x) = \sum_{i}c_{i}G(x-x_{i}) $$ where $G$ is the Green function (a.k.a. kernel in the context of regression) associated with the regularizer. By means of cross-validation you can search for good values of $\lambda$ and the order of the differential operator.