Can transformer be better than RNN for online speech recognition?

Question

Does transformer have the potential to replace RNN end-to-end models for speech recognition for online speech recognition? This mainly depends on accuracy/latency and deploy cost, not training cost. Can transformer support low latency online use case and have comparable deploy cost and better result than RNN models?

I found this article to be a nice starting point: https://desh2608.github.io/2020-01-08-transformer-asr — arr_sea, Oct 16 '20 at 06:06

score 2 · Answer 1 · answered Mar 08 '20 at 15:10

Are there examples that transformer have better accuracy than RNN end-to-end model like RNN-transducer for speech recognition? Can transformer be used for online speech recognition which require low speech-end-to-result latency? Does transformer have the potential to replace RNN end-to-end models for speech recognition in most cases in the future? This may mainly depends on accuracy and deploy cost, not training cost.

You can check facebook results on wav2letter on all this:

https://ai.facebook.com/blog/online-speech-recognition-with-wav2letteranywhere/

https://research.fb.com/publications/scaling-up-online-speech-recognition-using-convnets/

Transformers definitely have a potential in speech especially when combined with faster computatoin methods (hashing) just like in NLP.

The problem with transformers is that you need a lot of GPUs to train them.

This [answer over here](https://ai.stackexchange.com/a/20084) states that _transformers were introduced [...] with the purpose to avoid recursion in order to allow parallel computation (to reduce training time)_. Your last statement suggest that this may not be the case, at least from a practical perspective. Any clarification would appreciated. — bluenote10, Jul 30 '22 at 16:02

Can transformer be better than RNN for online speech recognition?

1 Answers1