Solved – How does the back-propagation work in a siamese neural network

neural networks

I have been studying the architecture of the siamese neural network introduced by Yann LeCun and his colleagues in 1994 for the recognition of signatures (“Signature verification using a siamese time delay neural network” .pdf, NIPS 1994)

I understood the general idea of this architecture, but I really cannot understand how the backpropagation works in this case.
I cannot understand what are the target values of the neural network, that will allow backpropagation to properly set the weights of each neuron.

Image from “Probabilistic Siamese Network for Learning Representations” by Chen Liu (University of Toronto 2013).

In this architecture, the algorithm computes the cosine similarity between the final representations of the two neural networks
The paper states: "The desired output is for a small angle between the outputs of the two subnetworks (f1 and f2) when to genuine signatures are presented, and a large angle if one of the signatures is a forgery".

I cannot really understand how they could use a binary function (cosine similarity between two vectors) as target to run the backpropagation.

How is the backpropagation computed in the siamese neural networks?

Best Answer

Both networks share the similar architectures and but they are constrained to have the same weights as the publication describes at section 4 [1].

Their goal is to learn features that minimize the cosine similarity between, their output vectors when signatures are genuine, and maximize it when they are forged (this is the backprop goal as well, but the actual loss function is not presented).

The cosine similarity $\cos(A,B) = {A \cdot B \over \|A\| \|B\|}$ of two vectors $A, B$, is a measure of similarity that gives you the cosine of the angle between them (therefore, its output is not binary). If your concern is how you can backprop to a function that outputs either true or false, think of the case of binary classification.

You shouldn't change the output layer, it consists of trained neurons with linear values and its a higher-level abstraction of your input. The whole network should be trained together. Both outputs $O_1$ and $O_2$ are passed through a $cos(O_1,O_2)$ function that outputs their cosine similarity ($1$ if they are similar, and $0$ if they are not). Given that, and that we have two sets of input tuples $X_{Forged}, X_{Genuine}$, an example of the simplest possible loss function you could have to train against could be:

$$\mathcal{L}=\sum_{(x_A,x_B) \in X_{Forged}} cos(x_A,x_B) - \sum_{(x_C,x_D) \in X_{Genuine}} cos(x_C,x_D)$$

After you have trained your network, you just input the two signatures you get the two outputs pass them to the $cos(O_1,O_2)$ function, and check their similarity.

Finally, to keep the network weights identical there are several ways to do that (and they are used in Recurrent Neural Networks too); a common approach is to average the gradients of the two networks before performing the Gradient Descent update step.

[1] http://papers.nips.cc/paper/769-signature-verification-using-a-siamese-time-delay-neural-network.pdf

Related Solutions

Solved – Neuron vs. unit in a neural network

Let me suggest one scenario (the only one I can think of) where it might be useful to distinguish between "units" (or some similarly generic term) and "neurons." Biologically, a neuron is easy to identify, because it represents a single cell. In terms of neural nets, a neuron or "unit" has typically represented a single object, usually with one activation value, plus an additional threshold or separate input and output values in some cases. Problems arise in distinguishing between a neuron and a "unit" when we take into account the fact that the inputs, outputs, activations and thresholds of biological neurons are often mediated by multiple neurotransmitters and specific subsets of connections on the dendrites - many of which can be modeled as separate units. Then the line between "neuron" and "unit" blurs quickly. As William F. Allman puts it in pp. 65-66, Apprentices of Wonder: Inside the Neural Network Revolution (1989, Bantam Books: New York):

"An axon may release various amounts of transmitter; a receiving dendrite might have varying amounts of receptor; the transmitter itself may have different checmical properties and react at different rates. And the whole process may be mitigated by the action of various enzymes.”

Here's a more thorough treatment from Daniel Gardner (1993, The Neurobiology of Neural Networks, MIT Press: Cambridge, Mass.) (I lost the page number to this, so I can't provide an exact citation):

" First, it has become evident that neurons (both in vertebrates and invertebrates) possess rich and complex intrinsic properties. Most neurons have multiple channels to different ionic species, and these channels can be regulated in a wide variety of ways: They can be turned on or off by voltage, molecules, or ions. Some of these channels can be active in the absence of external inputs to the cell, and endow it with a variety of dynamic properties, such as the ability to oscillate (Llinas 1988; Selverston 1988; Yamada et al. 1989). Thus, it is not enough to specify the inputs to a neuron to predict its outputs; its internal state will also determine its behavior. As a consequence, neurons may be better represented as nonlinear dynamic systems in their own right. For example, the intrinsic conductances of thalamic neurons can allow them to act as linear input/output devices, relaying information directly to cortex, but when they are hyperpolarized, these conductances cause the neurons to burst, significantly transforming their inputs (Llinas 1988). In terms of the model neurons that have often been used in artificial neural networks, the input/output relationship would need to be represented as a function both of voltage and of time. "Second, the interactions between neurons are complex. The differential distribution of synapses on complex dendritic trees of neurons can significantly affect the nature and intensity of their inputs to a neuron. In addition, synapses may have multiple time courses (e.g., initial excitation, slower inhibition, and still slower excitation [Getting and Dekin 1985), and connections may be dynamically reconfigured (e.g., by inhibition of specific neurons [Getting and Dekin 1985, or by the actions of neuromodulators [Harris-Warrick and Marder 1991; Marder and Hooper 19851). Receptors controlling the synaptic response may be gated both by the presence of a chemical, such as a neurotransmitter, and by voltage, so that the synaptic connections between neurons can be affected by their own activity and by the activity of neurons impinging on them (discussions of these and other complexities in synaptic interactions are found in chapters 2, 3, and 4). Influences may occur over a variety of spatial and temporal scales: A neuromodulator which is only slowly broken down may affect a very large number of neurons in its vicinity over an appreciable period of time as it diffuses away from its point of release. Furthermore, such compounds are likely to selectively activate those subgroups of neurons that have a receptor for that substance. Neuromodulators may also have subtle but profound effects on the intrinsic properties of neurons, activating or inactivating intrinsic currents and thus changing their "electrical personality." Field potentials may alter the excitability of neurons in different regions of the brain (Nunez 1981)."

I've run across other such quotes in the literature with similar detail, but those two should get the point across (Gardner's book may be a good starting point if you want to look into the matter further). In cases where we're dealing with multiple activations, thresholds and the like, it might be helpful to make a distinction between "neurons" and constituent "units" that contribute their own activations and other calculations; there's such bewildering complexity to these matters that I don't think anyone can give a definitive answer as to the best way to model such distinctions. I ran into this problem when trying to implement Fukushima's neocognitrons, in which each neuron has its own separate inhibitory and stimulatory inputs; first I tried modeling them as separate neurons, then as a single neuron with multiple outputs, but I'm still not certain what the optimal choice is. There may be solid computational advantages to modeling many of these various enzymes, neurotransmitters and receptors beyond mere biological plausibility; perhaps there's not; the whole topic is still far afield, even for neuroscientists, who still have much to learn about the purposes of such connections. I suspect such questions will become far more complex and pressing in the future once the field of neuroscience advances, enabling neural net researchers to mimic more of these internal calculations. For the time being it's safe to equate neurons with "units," but that might not be the case once more sophisticated neural nets begin to make practical use of this dizzying array of computations.

Solved – In neural networks, how to compute the mean square error (MSE) in gradient update when using a minibatch

If you want N output values instead of 1, you should implement generalPerceptron:forward() such that if it receives an NxM input matrix (i.e. N samples with M features), it outputs N values. I.e. it should perform a matrix multiplication between the input and the weights of the network.

EDIT: based on your comment. If you cannot modify forward(), you can just iterate over the input samples one at a time. I.e. you take one input sample, get a prediction using forward() on that input sample only, and use that to calculate the gradient. If you want to do minibatch with MSE, you would do something like this (pseudocode-ish):

For each minibatch:
    sumgrad = 0
    For each x_i in this minibatch:
       yhat = generalPerceptron:forward(x_i)
       error = 0.5*(target - yhat)^2 -- squared error
       sumgrad += gradient(error) --accumulate gradient
    weightupdate(learningRate, sumgrad/size(minibatch)) --update with learning rate and gradient average; not sure of the right Torch functions

I.e. you accumulate gradient over minibatch samples and use this for your weight update. A quick Google search results in this Torch tutorial code for minibatch learning.

Best Answer

Related Solutions

Solved – Neuron vs. unit in a neural network

Solved – In neural networks, how to compute the mean square error (MSE) in gradient update when using a minibatch

Related Question