RNN/LSTM networks on spectrograms underfitting massively – is the CNN encoder a prerequisite

audiokeraslstmrecurrent neural networktensorflow

I am prototyping a pipeline on the FSDD dataset (audio/10-class classification); the audio data are loaded with librosa, 0-padded/trimmed to 0.5 sec (4000-dimensioned numpy vectors) each and converted to mel-spectrograms with a 512 frame size, 256 hop-size and 80 mel banks. That yields mel spectrograms with a (80,16) shape.

I wanted to run a model that utilizes the temporal aspect of the data, therefore I am using LSTMs with keras. From tutorials (e.g https://machinelearningmastery.com/understanding-simple-recurrent-neural-networks-in-keras/) I have seen that keras reads inputs for RNNs like so: (batch_size, time_steps, features). Therefore, I created a dataloader with the transposed mel-spectrograms to follow that read pattern. Essentially, how I understand it, is that by feeding a 2D array to a keras RNN, rows correspond to timesteps and and columns to features.

I am running a really basic LSTM on the data:

IN_SHAPE = (16,80)
model = keras.Sequential()
model.add(layers.Input(shape=IN_SHAPE))
model.add(layers.LSTM(128))
model.add(layers.Dense(100, activation='relu'))
model.add(layers.Dense(10, activation=tf.keras.activations.softmax))

model.summary()

model.compile(
    optimizer=tf.keras.optimizers.Adam(lr=0.001),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

history = model.fit(
    train_set,
    epochs=N_EPOCHS,
    validation_data=val_set
)

It seems to be underfitting a lot (I have tried different learning rates and adding a subsequent LSTM layers)and what is most peculiar is that both for train and validation, the accuracy fluctuates among the same exact values, below I list the print history accuracies for training as evidence:

[0.10355556011199951, 0.09955555945634842, 0.1137777790427208, 0.1088888868689537, 0.09022222459316254, 0.10711111128330231, 0.1088888868689537, 0.10488889366388321, 0.10355556011199951, 0.109333336353302, 0.10533333569765091, 0.10311111062765121,
0.1088888868689537, 0.10355556011199951 …]

  • Firstly, I was wondering whether conceptually my understanding of how
    the keras RNN reads the transposed mel-spectrograms is right/wrong.
  • Secondly, I was also wondering whether the results are bad because RNNs and sequence models in general do not model well spectrograms/multidimensional data.

Best Answer

The problem had to do with preprocessing of the data. Conceptually, the understanding of how RNNs would read an image file is correct, i.e. rows correspond to time steps and columns to features.

On the second question, it follows from the first realization and the answer is that RNNs can model spectrograms/multidimensional data fine as the results retrieved after fixing data issues were good.