Solved – How does one Initialize Neural Networks as suggested by Saxe et al using Orthogonal matrices and a gain factor

conv-neural-networkmachine learningneural networksoptimizationpython

I was reading Bengio, Goodfellow and Courville deep learning book and on chapter 8 (Optimization chapter) they mention that Saxe et al have a initialization based on orthogonal matrices and a gain factor $g$ that depends on the non-linearity. The chapter doesn't actually say how to do this initialization. To address this issue I tried reading the paper but it seems to be a bit beyond my level of (maths) sophistication. Does anyone understand what the initialization that they are referring to is done?

For example the question's that would be nice to know are:

How does one choose orthogonal matrices? Just any K orthogonal matrices for any weight matrices?
How is $g$ chosen depending on the non-linearity?

I probably should have mentioned but Iam planning to use it with python/tensorflow if possible.

3 Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, Andrew M. Saxe, James L. McClelland, Surya Ganguli

Best Answer

Here is what Lasagne does, it should answer your two questions:

class Orthogonal(Initializer):
    """Intialize weights as Orthogonal matrix.
    Orthogonal matrix initialization [1]_. For n-dimensional shapes where
    n > 2, the n-1 trailing axes are flattened. For convolutional layers, this
    corresponds to the fan-in, so this makes the initialization usable for
    both dense and convolutional layers.
    Parameters
    ----------
    gain : float or 'relu'
        Scaling factor for the weights. Set this to ``1.0`` for linear and
        sigmoid units, to 'relu' or ``sqrt(2)`` for rectified linear units, and
        to ``sqrt(2/(1+alpha**2))`` for leaky rectified linear units with
        leakiness ``alpha``. Other transfer functions may need different
        factors.
    References
    ----------
    .. [1] Saxe, Andrew M., James L. McClelland, and Surya Ganguli.
           "Exact solutions to the nonlinear dynamics of learning in deep
           linear neural networks." arXiv preprint arXiv:1312.6120 (2013).
    """
    def __init__(self, gain=1.0):
        if gain == 'relu':
            gain = np.sqrt(2)

        self.gain = gain

    def sample(self, shape):
        if len(shape) < 2:
            raise RuntimeError("Only shapes of length 2 or more are "
                               "supported.")

        flat_shape = (shape[0], np.prod(shape[1:]))
        a = get_rng().normal(0.0, 1.0, flat_shape)
        u, _, v = np.linalg.svd(a, full_matrices=False)
        # pick the one with the correct shape
        q = u if u.shape == flat_shape else v
        q = q.reshape(shape)
        return floatX(self.gain * q)

This RNN tutorial does the same thing (minus the gain):

# orthogonal initialization for weights
# see Saxe et al. ICLR'14
def ortho_weight(ndim):
    W = numpy.random.randn(ndim, ndim)
    u, s, v = numpy.linalg.svd(W)
    return u.astype('float32')

So I assume it's correct (I hope so since this is the code I use).

I probably should have mentioned but Iam planning to use it with python/tensorflow if possible.

In TensorFlow:

def orthogonal_initializer(scale = 1.1):
    ''' From Lasagne and Keras. Reference: Saxe et al., http://arxiv.org/abs/1312.6120
    '''
    print('Warning -- You have opted to use the orthogonal_initializer function')
    def _initializer(shape, dtype=tf.float32):
      flat_shape = (shape[0], np.prod(shape[1:]))
      a = np.random.normal(0.0, 1.0, flat_shape)
      u, _, v = np.linalg.svd(a, full_matrices=False)
      # pick the one with the correct shape
      q = u if u.shape == flat_shape else v
      q = q.reshape(shape) #this needs to be corrected to float32
      print('you have initialized one orthogonal matrix.')
      return tf.constant(scale * q[:shape[0], :shape[1]], dtype=tf.float32)
    return _initializer

Best Answer

Related Solutions

Solved – How exactly do convolutional neural networks use convolution in place of matrix multiplication

Single output

Multiple outputs

Solved – What does the term “receptive field size” in the MatConvNet library mean

Related Question