Solved – How to calculate derivative of the contractive auto-encoder regularization term

checkingmathematical-statisticsneural networksregularization

Setup

I found a paper on that has a varient on normal auto-encoders (contractive) which for its gradient uses the following regularization penalty:

$$\left|\left|J_f(x)\right|\right|^2_F = \sum_{ij}{\left( \frac{\partial h_j(x)}{\partial x_i} \right)}^2$$

where $\left|\left|\cdot\right|\right|_F^2$ is the Frobenius norm, $h$ is the hidden units, and $x$ is the input. The paper also gives an alternative form (when a sigmoid is used for $f$) to the equation that looks like:

$$\left|\left|J_f(x)\right|\right|^2_F = \sum_{i=1}^{d_h}(h_i(1-h_i))^2\sum_{j=1}^{d_x}W^2_{ij}$$

Question 1

As per usual, no actual derivation is given to get the second form of the equation in the paper. I attempted to derive it myself, but would greatly appreciate it if someone could check my work and let me know what mistakes I might have made.

$$ a(x) = W^T x + b $$
$$ h(x) = f(a(x)) $$
$$ f(x) = \frac{1}{1+e^{-x}} $$

Then, using the chain rule:

$$ \frac{\partial h(x)}{\partial x} = \frac{\partial h(x)}{\partial a(x)} \frac{\partial a(x)}{\partial x} $$

Using the standard derivation of the sigmoid:
$$
\frac{\partial h(x)}{\partial a(x)} = f(a(x))(1 – f(a(x))) = h(1- h)
$$

and:

$$
\frac{\partial a(x)}{\partial x} = W
$$

Thus, finally:

$$
\frac{\partial h(x)}{\partial x} = h(1-h)W
$$

Question 2

The problem I run into is that I'm not entirely sure of how to take the derivative of $\left(\frac{\partial h(x)}{\partial x}\right)^2$ with respect to $W$ in order to be able to get the gradient. If I square my previous result, I should get:

$$
\frac{\partial \left[h(1-h)\right]^2W^2}{\partial W}
$$

which I would think would give:

$$
2\left[h(1-h)\right]^2 W
$$

And I'm having difficulty in interpreting this. I am a bit confused about whether h(1-h) is then an dot product. If so, does that just give me a scalar multiplied by W? If not, and it's an element-wise multiplication, then I think the dimensionality would be all wrong.

Or perhaps I did all of this incorrectly. Any help would be greatly appreciated!

Best Answer

when I interpret your equations correctly, the $W$ is supposed to be a matrix. This means that $a(x)$ is a vector and the then chain rule actually reads: $$\frac{\partial h_i}{\partial x_j} = \sum_k \frac{\partial h_i}{\partial a_k} \frac{\partial a_k}{\partial x_j}.$$

In matrix notation, the second term is $\frac{\partial a}{\partial x} = W^\top$. If I interpret your equations correctly, $f(x)$ is applied to each element of $a$ individually. Therefore, $\frac{\partial h_i}{\partial a_j} = \delta_{ij}h(a_j)(1-h(a_j))$ which means that it is a diagonal matrix with the term $h(a_j)(1-h(a_j))$ as the $j$th entry. Therefore $$\frac{\partial h_i}{\partial x_j} = \sum_k \delta_{ik} h(a_k)(1-h(a_k)) W_{kj} = h(a_i)(1-h(a_i)) W_{ij}.$$ Thus, $$\sum_{ij}\left(\frac{\partial h_i}{\partial x_j}\right)^2 =\sum_{ij} h(a_i)^2(1-h(a_i))^2 W_{ij}^2 = \sum_{i} h(a_i)^2(1-h(a_i))^2 \sum_{i} W_{ij}^2$$

Edit (second derivative): Since I am guessing that you want the derivative of the regularizer w.r.t to $W$, here is what I get (please check it numerically for correctness)

$$\frac{\partial}{\partial W_{kl}}\|J_{f}\|_{F}^{2}=\frac{\partial}{\partial W_{kl}}\sum_{i}h(a_{i})^{2}(1-h(a_{i}))^{2}\sum_{j}W_{ij}^{2}$$ $$=\sum_{i}\left(h(a_{i})^{2}(1-h(a_{i}))^{2}\sum_{j}\delta_{ik}\delta_{jl}2W_{ij}+\left(\sum_{j}W_{ij}^{2}\right)\frac{\partial}{\partial W_{kl}}h(a_{i})^{2}(1-h(a_{i}))^{2}\right)$$ $$=h(a_{k})^{2}(1-h(a_{k}))^{2}2W_{kl}+\sum_{i}\left(\sum_{j}W_{ij}^{2}\right)\left(2h(a_{i})h'(a_{i})\frac{\partial a_{i}}{\partial W_{kl}}\cdot(1-h(a_{i}))^{2}-h(a_{i})^{2}2(1-h(a_{i}))h'(a_{i})\frac{\partial a_{i}}{\partial W_{kl}}\right)$$ $$=2h(a_{k})^{2}(1-h(a_{k}))^{2}W_{kl}+\sum_{i}\left(\sum_{j}W_{ij}^{2}\right)2\left(h(a_{i})^{2}(1-h(a_{i}))^{3}\delta_{ik}x_{l}-h(a_{i})^{3}(1-h(a_{i}))^{2}\delta_{ik}x_{l}\right)$$ $$=2h(a_{k})^{2}(1-h(a_{k}))^{2}W_{kl}+\left(\sum_{j}W_{kj}^{2}\right)2h(a_{k})^{2}(1-h(a_{k}))^{3}x_{l}-\left(\sum_{j}W_{kj}^{2}\right)2h(a_{k})^{3}(1-h(a_{k}))^{2}x_{l}$$ $$=2h(a_{k})^{2}(1-h(a_{k}))^{2}\left(W_{kl}+x_{l}\left(1-2h(a_{k})\right)\left(\sum_{j}W_{kj}^{2}\right)\right)$$

Sory for the mess but I wanted you to be able to follow my calculations in case the numerical derivative turns out to be not correct.

Best Answer

Related Solutions

Solved – Deriving gradient of a single layer neural network w.r.t its inputs, what is the operator in the chain rule

Solved – Derivation of Restricted Boltzmann Machine Conditional Probability

Related Question