I did some experimenting with Keras' MNIST tutorial.
If I edit the model to be fully convolutional, then train it, I encounter the same problem.
If I instead train the model as written, save the weights, and then import them to a convolutionalized model (reshaping where appropriate), it tests as perfectly equivalent. However, training it further causes accuracy to drop drastically.
So changing the network to be fully convolutional changes the gradient in some way, such that the network no longer converges at an optimum. This page claims that there is some way to train a network as fully convolutional from the start, but does not say how. Possibly it involves the use of a different loss function.
For those interested, my code for convolutionalizing the MNIST tutorial and reimporting the weights is below.
from __future__ import print_function
import keras
from keras.utils import plot_model
from keras.datasets import mnist
from keras.models import Sequential, load_model
from keras.layers import Dense, Dropout, Flatten, Activation
from keras.layers import Conv1D, Conv2D, MaxPooling2D
from keras import backend as K
import numpy as np
*same as tutorial*
weights = load_model('CNN.h5').get_weights()
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
activation='relu',
input_shape=input_shape,
padding='same'))
model.add(Conv2D(64, (3, 3), activation='relu',
padding='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
#model.add(Flatten())
#model.add(Dense(128, activation='relu'))
model.add(Conv2D(128, (14,14), activation='relu', padding='valid'))
model.add(Dropout(0.5))
#model.add(Dense(num_classes, activation='softmax'))
model.add(Conv2D(num_classes, (1,1), activation='softmax'))
model.add(Flatten())
plot_model(model, 'model.png', show_shapes=True)
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=['accuracy'])
model.layers[0].set_weights([weights[0], weights[1]])
model.layers[1].set_weights([weights[2], weights[3]])
model.layers[4].set_weights([weights[4].reshape([14,14,64,128]), weights[5]])
model.layers[6].set_weights([weights[6].reshape([1,1,128,num_classes]), weights[7]])
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs,
verbose=1, validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
#model.save('CNN.h5')
print('Test loss:', score[0])
print('Test accuracy:', score[1])
The first thing to do if your NN is not converging is to repeatedly reduce the learning rate. It's the most important hyperparameter.
Divide the LR parameter by 10, try again, rinse repeat. You might find, for example, it needs to be 10000 times smaller, before you stop bouncing around and actually start descending the gradient of your loss surface.
Adam is often regarded as the best "out of the box" optimiser, you might want to start with that instead of SGD.
opt = keras.optimizers.Adam(lr=nnParams['lr'], beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
Best Answer
Does the code below mean we are doing 2 convolutions before max pooling? Yes, it means you are doing two convolutions before pooling.
If so, why are we doing it twice and then pooling? Why not? This is a just a different model. The results are not going to change a whole lot and by no means it's wrong to do this. In fact, this will probably improve the accuracy of the model, since more convolutions before reducing the size of the feature maps with the pooling can lead to more interesting representations of the data.
The intuition is: before doing pooling, you have more pixels than after (and before the first pooling you even have all the original pixels). Thus, the filters will be able to slide more times along the image and perform more convolutional operations, leading to a richer representation.
The trade-off of course is computational time. That is why more modern models started stacking many more convolutional layers before the pooling layers.