Neural Networks – Using Transformers Without CTC Loss and Language Models

neural networkspythonrecurrent neural network

In the creation of an ASR (Automatic Speech Recognition) using transformers neural networks, we don’t need to use a CTC loss and language models, do we?

I’ve seen that RNNs do need to use language models and CTC loss.

By saying language models, I’m referring to models like kenlm that helps to see if the sentence has sense.

But first, transformers pay attention right? It also goes through a Positional embedding, so CTC loss is useful?

Why a language model if the attention mechanism already do that…

Best Answer

You're mixing things up here.

  1. CTC is a loss function for neural networks used for sequence-to-sequence tasks (e.g. audio to text)
  2. RNN, CNN, transformers, ... describe the architecture/type of a model
  3. Language models are a post-processing step, applied after the model computed an output for the input

You can use the CTC loss with CNN-only models, RNN-only models, and all other sorts of models too as long as you have this sequential nature of the data. So I don't see a reason why not to use them for transformers too.

Language models are a post-processing step and are optional. Often they can fix small errors, e.g. when the models predicts "Hella", a language model might be able to correct that and make a "Hello" out of it. But don't expect too much from them.

Related Question