Solved – Confused about Dropout implementations in Tensorflow

deep learningdropoutneural networksoptimizationtensorflow

I have a network whose input size is 100 and output size 2. Only these layers. I applied a dropout with keep_prob of 0.8 and I tried to understand the outcome.

As expected, the dropout mask has around 17-23 zeros every time time I run it, however, almost all the weights are updated. According to the paper:

Forward and back-propagation for that training case are done only on this thinned network.

So I was expecting that around 80 of my weights will change in each step of the training, but in reality they are all changing (in the beginning around 90-95 change, and in the next iterations all of them change).

I don't know if this has to do with the way Dropout is implemented in Tensorflow. Does somebody know why is this happening?

This is the code I'm running to check it.

import numpy as np
import tensorflow as tf

# As input, 100 random numbers.
input_size = 100
output_size = 2

x = tf.placeholder(tf.float32,[None, input_size],name="input")
y = tf.placeholder(tf.float32,[None, output_size],name="labels")

with tf.variable_scope("dense1") as scope:
    W = tf.get_variable("W",shape=[input_size,output_size],initializer=tf.keras.initializers.he_uniform())
    b = tf.get_variable("b",initializer=tf.zeros([output_size]))
    dropped = tf.nn.dropout(x,0.8)
    dense = tf.matmul(dropped,W)+b

eval_pred = tf.nn.sigmoid(dense,name="prediction")

cost = tf.reduce_mean(tf.losses.absolute_difference(eval_pred,y))
train_step = tf.train.AdamOptimizer(learning_rate=0.01).minimize(cost)


# 20 epochs, batch size of 1
epochs = 20

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    allWeights = []
    for i in range(epochs):

        x_raw = np.random.random((1,input_size))
        y_raw = np.random.random((1,output_size))
        [_,c,d,w]=sess.run([train_step,cost,dropped,W], feed_dict={x: x_raw, y: y_raw})
        #print("Epoch {0}/{1}. Loss: {2}".format(i+1,epochs,c))

        # Numbers will be around 20% of input_size (17-22)
        print(np.sum(d==0))
        allWeights.append(w)

print("Calculate the difference between W_i and W_{i-1}")
for wi in range(1,len(allWeights)):
    difference = allWeights[wi]-allWeights[wi-1]
    # I expect that there will be around 20 weights that won't be updated
    # so the difference between the current weight and the previous one
    # should be zero.
    print(np.sum(difference==0))

Best Answer

This is because you're using the Adam optimizer. The Adam optimizer is a kind of momentum optimizer (specifically, it tracks the first and second moments of the updates), so an update will still occur for all model parameters even though dropout is present.

Relation to Word2Vec

==========================================

Word2Vec in a simple picture:

word2vec pic

More in-depth explanation:

I believe it's related to the recent Word2Vec innovation in natural language processing. Roughly, Word2Vec means our vocabulary is discrete and we will learn an map which will embed each word into a continuous vector space. Using this vector space representation will allow us to have a continuous, distributed representation of our vocabulary words. If for example our dataset consists of n-grams, we may now use our continuous word features to create a distributed representation of our n-grams. In the process of training a language model we will learn this word embedding map. The hope is that by using a continuous representation, our embedding will map similar words to similar regions. For example in the landmark paper Distributed Representations of Words and Phrases and their Compositionality, observe in Tables 6 and 7 that certain phrases have very good nearest neighbour phrases from a semantic point of view. Transforming into this continuous space allows us to use continuous metric notions of similarity to evaluate the semantic quality of our embedding.

Explanation using Lasagne code

Let's break down the Lasagne code snippet:

x = T.imatrix()

x is a matrix of integers. Okay, no problem. Each word in the vocabulary can be represented an integer, or a 1-hot sparse encoding. So if x is 2x2, we have two datapoints, each being a 2-gram.

l_in = InputLayer((3, ))

The input layer. The 3 represents the size of our vocabulary. So we have words $w_0, w_1, w_2$ for example.

W = np.arange(3*5).reshape((3, 5)).astype('float32')

This is our word embedding matrix. It is a 3 row by 5 column matrix with entries 0 to 14.

Up until now we have the following interpretation. Our vocabulary has 3 words and we will embed our words into a 5 dimensional vector space. For example, we may represent one word $w_0 = (1,0,0)$, and another word $w_1 = (0, 1, 0)$ and the other word $w_2 = (0, 0, 1)$, e.g. as hot sparse encodings. We can view the $W$ matrix as embedding these words via matrix multiplication. Therefore the first word $w_0 \rightarrow w_0W = [0, 1, 2, 3, 4].$ Simmilarly $w_1 \rightarrow w_1W = [5, 6, 7, 8, 9]$.

It should be noted, due to the one-hot sparse encoding we are using, you also see this referred to as table lookups.

l1 = EmbeddingLayer(l_in, input_size=3, output_size=5, W=W)

The embedding layer

 output = get_output(l1, x)

Symbolic Theano expression for the embedding.

f = theano.function([x], output)

Theano function which computes the embedding.

x_test = np.array([[0, 2], [1, 2]]).astype('int32')

It's worth pausing here to discuss what exactly x_test means. First notice that all of x_test entries are in {0, 1, 2}, i.e. range(3). x_test has 2 datapoints. The first datapoint [0, 2] represents the 2-gram $(w_0, w_2)$ and the second datapoint represents the 2-gram $(w_1, w_2)$.

We wish to embed our 2-grams using our word embedding layer now. Before we do that, let's make sure we're clear about what should be returned by our embedding function f. The 2 gram $(w_0, w_2)$ is equivalent to a [[1, 0, 0], [0, 0, 1]] matrix. Applying our embedding matrix W to this sparse matrix should yield: [[0, 1, 2, 3, 4], [10, 11, 12, 13, 14]]. Note in order to have the matrix multiplication work out, we have to apply the word embedding matrix $W$ via right multiplication to the sparse matrix representation of our 2-gram.

f(x_test)

returns:

          array([[[  0.,   1.,   2.,   3.,   4.],
                  [ 10.,  11.,  12.,  13.,  14.]],
                 [[  5.,   6.,   7.,   8.,   9.],
                  [ 10.,  11.,  12.,  13.,  14.]]], dtype=float32)

To convince you that the 3 does indeed represent the vocabulary size, try inputting a matrix x_test = [[5, 0], [1, 2]]. You will see that it raises a matrix mis-match error.

Solved – Dropout: scaling the activation versus inverting the dropout

Andrew made very good explanation in his Deep Learning course on this session Dropout Regularization:

Inverted dropout is more common because it makes the testing much easier
The purpose of the inverting is to assure that the Z value will not be impacted by the reduce of W.

Say a3 = a3 / keep_prob at the last step of implementation:

Z^[4] = W^[4] * a^[3] + b^[4] , the element size of a^[3] has been reduced by keep_prob from D3(a percentage of elements have been dropped out by D3), thus the value of Z^[4] is also gonna be reduced, so to compensate this roughly we shall invert the change by dividing keep_prob to make sure the value of Z^[4] will not be impacted.

Best Answer

Related Solutions

Neural Networks – What is an Embedding Layer in a Neural Network?

Relation to Word2Vec

Word2Vec in a simple picture:

More in-depth explanation:

Explanation using Lasagne code

Solved – Dropout: scaling the activation versus inverting the dropout

Related Question