Solved – feature extraction: freezing convolutional base vs. training on extracted features

conv-neural-networkdeep learningkerasmachine learning

[Note: To clarify, this question is concerned about the theory and the codes are only used to better explain the issue. This is not in any way a programming question.]

In section 5.3 of "Deep learning with python by François Chollet" the process of using a pre-trained network for deep learning on small image datasets is explained. Two different approaches for feature extraction (using only the convolutional base of VGG16) are introduced:

1. FAST FEATURE EXTRACTION WITHOUT DATA AUGMENTATION: in this approach first the features of each image in the dataset are extracted by calling the predict method of the conv_base model. Here is the code for reference:

from keras.applications import VGG16
conv_base = VGG16(weights='imagenet',
                  include_top=False,
                  input_shape=(150, 150, 3))

import os
import numpy as np

from keras.preprocessing.image import ImageDataGenerator
base_dir = '/Users/fchollet/Downloads/cats_and_dogs_small'
train_dir = os.path.join(base_dir, 'train')
validation_dir = os.path.join(base_dir, 'validation')
test_dir = os.path.join(base_dir, 'test')

datagen = ImageDataGenerator(rescale=1./255)
batch_size = 20

def extract_features(directory, sample_count):
    features = np.zeros(shape=(sample_count, 4, 4, 512))
    labels = np.zeros(shape=(sample_count))
    generator = datagen.flow_from_directory(
        directory,
        target_size=(150, 150),
        batch_size=batch_size,
        class_mode='binary')
    i = 0
    for inputs_batch, labels_batch in generator:
        features_batch = conv_base.predict(inputs_batch)
        features[i * batch_size : (i + 1) * batch_size] = features_batch
        labels[i * batch_size : (i + 1) * batch_size] = labels_batch
        i += 1
        if i * batch_size >= sample_count:
            break

    return features, labels

train_features, train_labels = extract_features(train_dir, 2000)
validation_features, validation_labels = extract_features(validation_dir, 1000)
test_features, test_labels = extract_features(test_dir, 1000)

Then these features will be fed to a densely connected classifier for classification which is trained from scratch:

train_features = np.reshape(train_features, (2000, 4 * 4 * 512))
validation_features = np.reshape(validation_features, (1000, 4 * 4 * 512))
test_features = np.reshape(test_features, (1000, 4 * 4 * 512))

from keras import models
from keras import layers
from keras import optimizers

model = models.Sequential()
model.add(layers.Dense(256, activation='relu', input_dim=4 * 4 * 512))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer=optimizers.RMSprop(lr=2e-5),
              loss='binary_crossentropy',
              metrics=['acc'])

history = model.fit(train_features, train_labels,
                    epochs=30,
                    batch_size=20,
                    validation_data=(validation_features, validation_labels))

2. FEATURE EXTRACTION WITH DATA AUGMENTATION: in this approach (which is much slower) the convolutional base is extended by adding a densely connected classifier on top of it and the training is done end-to-end. However, the convolutional layers are freezed to prevent their weights from being updated:

from keras import models
from keras import layers

model = models.Sequential()
model.add(conv_base)
model.add(layers.Flatten())
model.add(layers.Dense(256, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

conv_base.trainable = False

from keras.preprocessing.image import ImageDataGenerator
from keras import optimizers

train_datagen = ImageDataGenerator(
    rescale=1./255,
    rotation_range=40,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest')

test_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(
    train_dir,
    target_size=(150, 150),
    batch_size=20,
    class_mode='binary')

validation_generator = test_datagen.flow_from_directory(
    validation_dir,
    target_size=(150, 150),
    batch_size=20,
    class_mode='binary')

model.compile(loss='binary_crossentropy',
          optimizer=optimizers.RMSprop(lr=2e-5),
          metrics=['acc'])

history = model.fit_generator(
    train_generator,
    steps_per_epoch=100,
    epochs=30,
    validation_data=validation_generator,
    validation_steps=50)

My questions:

1) I don't understand the difference between the first and the second approach (with the exception of using data augmentation in the second approach and an additional dropout layer in the first approach). The wights of convolutional base in the second approach are not updated, so it is only used in the forward pass. Therefore it is essentially the same as the convolutional base in the first approach and the classifiers are identical as well so I think they should give us the same accuracy (and speed). What am I missing?

2) One thing that make me more anxious is the fact that I have tried both approaches on my machine. The second approach is much slower but both of them reach an accuracy of 90% on the validation data; whereas in the book it is suggested that the first and second approach reach an accuracy of 90% and 96%, respectively on the validation data. (If the approaches are different) Why does this happen?

3) It is suggested in the book that in the first approach we could not use data augmentation. It is not clear for me why this is so. Particularly, what does prevent us from using an ImageDataGenerator in the first approach like the one used in the second approach for generating training data? (Further, although it is claimed that the second approach use data augmentation but the fact is that, considering the value of batch_size and steps_per_epoch, the number of images used for training in both approaches are the same, i.e. 2000).

Best Answer

I think you understand it pretty correctly. To address your questions:

(1.) Therefore it is essentially the same as the convolutional base in the first approach and the classifiers are identical as well so I think they should give us the same accuracy (and speed). What am I missing?

The methods are generally equal. You are not missing anything. The difference in the accuracy is thanks to data augmentation (see below).

The second method is slower because you need to 1) generate augmented images on the fly, 2) compute the convnet features for every augmented image. The first method skips this and just uses precomputed convnet features for a fixed set of images.

(2.) In the book it is suggested that the first and second approach reach an accuracy of 90% and 96%, respectively on the validation data. (If the approaches are different) Why does this happen?

The second method should work better because it uses data augmentation. Data augmentation is extremely powerful thing, so improving the accuracy by 6% is expectable.

(3.) It is suggested in the book that in the first approach we could not use data augmentation. It is not clear for me why this is so.

Theoretically, you could use data augmentation for the first method. Instead of generating augmented samples on-the-fly, you would first generate a huge number of those (say, 1000 variants of every sample) and compute their convnet features, which you would use to train the classifier. The downsides of this approach are 1) higher memory requirements, 2) "limited" number of augmented samples (after every 1000 epochs, you just start using the same samples again). On the other hand, it is faster than the second approach.

(3.) considering the value of batch_size and steps_per_epoch, the number of images used for training in both approaches are the same, i.e. 2000

In every epoch, both methods use 2000 images. However, the first method uses the same 2000 images in every epoch. The second method uses different, augmented versions of those images every epoch.