Solved – Replacing RNNs with dilated convolutions

conv-neural-networkrecurrent neural network

I'm currently working on a neural network for Handwritten Text Recognition (HTR) which expects images of words as input and outputs labels for those words. My HTR system is inspired by CRNN [1], which works quite well and contains:

  1. CNN for feature extraction
  2. RNN (LSTM) for sequential modelling and per-frame character predictions
  3. CTC as a loss function and to decode the per-frame predictions into the final label

Recently I came across a paper about sequence labelling with dilated convolutions [2]. They used it in the domain of Natural Language Processing (NLP) to replace their bidirectional LSTMs. Only a few dilated convolution layers are needed to propagate information through all time-steps of the input sequence.

This paper kept me thinking: shouldn’t that also work in the domain of HTR? HTR and NLP have some similarities. Replacing the RNN part by such dilated convolutions, i.e. combining feature extraction and information propagation into the CNN part of the net. As I'm new to deep learning, I'm really looking forward to hear some feedback if it makes sense to invest time trying this out or if there is some reason why this will never work well. (Of course I'm also implementing a prototype, but this question should be seen from a theoretical point of view)

[1] https://arxiv.org/abs/1507.05717

[2] https://arxiv.org/abs/1702.02098

Best Answer

Just in case anyone is interested: yes, it is possible to replace the RNN layers by Dilated Convolutions (DCs). The architectures described in speech recognition literature did not work out of the box for HTR, but with some modifications results got better. I will give a short summary.

The NN contains CNN layers and a final CTC layer. In between, I integrated DCs. I grouped layers of DCs into blocks: each block has a layer with a sampling rate of 1, 2 and 4. Each kernel has size 3x3. The kernel weights (k1, k2, k4) across blocks are shared. The tested NN contains 2 blocks, and all intermediate outputs (o1, o2, ..., o6) are concatenated to form a large feature matrix. Finally, for each time-step the features are mapped to all possible characters, which are then fed into the CTC layer.

enter image description here

Related Question