In the creation of an ASR (Automatic Speech Recognition) using transformers neural networks, we don’t need to use a CTC loss and language models, do we?
I’ve seen that RNNs do need to use language models and CTC loss.
By saying language models, I’m referring to models like kenlm that helps to see if the sentence has sense.
But first, transformers pay attention right? It also goes through a Positional embedding, so CTC loss is useful?
Why a language model if the attention mechanism already do that…
Best Answer
You're mixing things up here.
You can use the CTC loss with CNN-only models, RNN-only models, and all other sorts of models too as long as you have this sequential nature of the data. So I don't see a reason why not to use them for transformers too.
Language models are a post-processing step and are optional. Often they can fix small errors, e.g. when the models predicts "Hella", a language model might be able to correct that and make a "Hello" out of it. But don't expect too much from them.