Should a CNN generalize to arbitrary positions in the data

conv-neural-networkimage processingmachine learningneural networkssignal processing

I have trained a CNN on one dimensional data that is the power spectral density (PSD) of a $N$ different classes of signals ($N=4$). Each of the $N$ signals has a different spectral shape (not shown here). For illustrative purposes, the plots shown below are from the same signal class. The idea is to treat this as an image classification problem.

The model performs exceptionally well when the training data has all the examples centered around the same frequency (i.e., within a few hundred Hertz):

The CNN fails to properly classify new examples that are outside the frequency range of the training data:

Model Details and Assumptions:

The CNN model is implemented in PyTorch using the following layers:

    model = nn.Sequential(
            nn.Conv1d(in_channels=1, out_channels=64, kernel_size=128, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size=16, stride=2),
            nn.Dropout(p=0.25),
            nn.Flatten(),
            nn.Linear(257664, n_classes)
    )

The optimizer is torch.optim.Adam. Batch size has been varied from 8 to 128, epochs varied from 10 to 50. The input data is normalized to $[0,1]$. Training examples varied from [2000, 8000], where 20% are used for validation and 20% are used for test.
I have also tried adding additional convolutional layers, varying the kernal sizes, neurons, layers, etc.

Questions:

Shouldn't the CNN model generalize such that new examples that are not within the same frequency range (i.e., centered around the middle) should be identified as the correct class?
Are there other steps I need to take, whether in the model or training data?

Best Answer

A CNN does not automatically need to generalize to arbitrary translations. I can see 2 possibilities that allow a CNN to generalise to arbitrary translations

Data augmentation

Image classification models are unaware of variations that may occur in new data that where not present in previous data.

If you trained a model to recognize shapes but always around a specific coordinate, then it might bake that location into the model. It will not recognize the shape independently from the location, and instead it will be more something like recognizing the values at specific locations. E.g. it learns that if there is a step function around the specific locations $ \pm 1000 Hz$ then it is of the class in your example.

When you feed the model during the training phase multiple variations of the data (you can use the same data, and apply some transformations) then the model will be able to adapt to those potential variations. See: https://stats.stackexchange.com/search?q=data+augmentation
Network architecture

You can have a convolutional neural networks be automatically translation invariant by adding a (global) pooling layer after the first convolutional layers in the model (without having layers before that which may filter the spatial information in the data before or after the convolution, although they might be neccesary for making up for information that gets lost by having a global pooling layer).

I see now that I missed your architecture description. You have no global pooling, and therefore the network is not automatically translation invariant.

I don't believe that more convolution layers are really neccesary (as you describe in your own answer) if the model already works when you have samples without translation. What you need is a global pooling layer. (Potentially you can place this after more than one convolution layers, but those additional layers do not create translation invariance)

The actual shapes in your signal classes may help to figure out a design (whether to use more layers or not). You should be able to figure out how more or less how the network connects the layers (or fails to connect) and how it recognizis the different classes.

Currently it seems like you have a single convolution with a filter of size 128, that gets filtered with a sliding window of maxPooling and finally a linear model determines the classes. That final linear model will only train to recognize the output of the kernel in the particular places that occur in the training. When the signal pattern occurs somewhere else at a different place, then there is no reason why the trained model should be recognising this. It are different neurons that become activated (relating to the kernel during convolution getting some high value in a different place) and your model did not learn to connect to those different neurons.

In short the image below is how I see this

Input signal
Several kernels (whose shape you train) convert this into a convolved signal. E.g. a filter like $1\cdot x_{k-2} + 1 \cdot x_{k-1} + 0\cdot x_{k} -1 \cdot x_{k+1} - 1\cdot k_{k+2}$ will turn the signal into positive and negative spikes whenever there is a step function.
You only keep the positive part.
Then you fit a linear model to predict the classes. But this model will only create strong links to the places where the training signal has often a spike in the convolved signal. There is no reason why the training should create response to spikes in different places.

Best Answer

Related Solutions

Solved – How to design neural networks for pattern recognition in biometry

Solved – time-series analysis Vs statistical signal processing

Related Question