Solved – Confused about Dropout implementations in Tensorflow

deep learningdropoutneural networksoptimizationtensorflow

I have a network whose input size is 100 and output size 2. Only these layers. I applied a dropout with keep_prob of 0.8 and I tried to understand the outcome.

As expected, the dropout mask has around 17-23 zeros every time time I run it, however, almost all the weights are updated. According to the paper:

Forward and back-propagation for that training case are done only on this thinned network.

So I was expecting that around 80 of my weights will change in each step of the training, but in reality they are all changing (in the beginning around 90-95 change, and in the next iterations all of them change).

I don't know if this has to do with the way Dropout is implemented in Tensorflow. Does somebody know why is this happening?

This is the code I'm running to check it.

import numpy as np
import tensorflow as tf

# As input, 100 random numbers.
input_size = 100
output_size = 2

x = tf.placeholder(tf.float32,[None, input_size],name="input")
y = tf.placeholder(tf.float32,[None, output_size],name="labels")

with tf.variable_scope("dense1") as scope:
    W = tf.get_variable("W",shape=[input_size,output_size],initializer=tf.keras.initializers.he_uniform())
    b = tf.get_variable("b",initializer=tf.zeros([output_size]))
    dropped = tf.nn.dropout(x,0.8)
    dense = tf.matmul(dropped,W)+b

eval_pred = tf.nn.sigmoid(dense,name="prediction")

cost = tf.reduce_mean(tf.losses.absolute_difference(eval_pred,y))
train_step = tf.train.AdamOptimizer(learning_rate=0.01).minimize(cost)


# 20 epochs, batch size of 1
epochs = 20

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    allWeights = []
    for i in range(epochs):

        x_raw = np.random.random((1,input_size))
        y_raw = np.random.random((1,output_size))
        [_,c,d,w]=sess.run([train_step,cost,dropped,W], feed_dict={x: x_raw, y: y_raw})
        #print("Epoch {0}/{1}. Loss: {2}".format(i+1,epochs,c))

        # Numbers will be around 20% of input_size (17-22)
        print(np.sum(d==0))
        allWeights.append(w)

print("Calculate the difference between W_i and W_{i-1}")
for wi in range(1,len(allWeights)):
    difference = allWeights[wi]-allWeights[wi-1]
    # I expect that there will be around 20 weights that won't be updated
    # so the difference between the current weight and the previous one
    # should be zero.
    print(np.sum(difference==0))

Best Answer

This is because you're using the Adam optimizer. The Adam optimizer is a kind of momentum optimizer (specifically, it tracks the first and second moments of the updates), so an update will still occur for all model parameters even though dropout is present.