Is there a detailed description or implementation of an end-to-end speech recognition system?

Question

I am currently trying to implement an end-to-end speech recognition system from scratch, that is, without using any of the existing frameworks (like TensorFlow, Keras, etc.). I am building my own library, where I am trying to do a polynomial approximation of functions (like exponential, log, sigmoid, ReLU, etc). I would like to have access to a nice description of the neural networks involved in an end-to-end speech recognition system, where the architecture (the layers, activation functions, etc.) is clearly laid out, so that I can implement it.

I find most of the academic or industry papers citing various previous works, toolkits or papers, making it tedious for me. I am new to the field, so I am having more difficulty, so looking for some help here.

I took the term "Large Vocabulary Continuous Speech Recognition" while reading Jurafsky-Martin book and I think it implies converting speech-to-text any real world conversation like the Wall-Street-Journal Corpus or the SwitchBoard Corpus. Unlike small datasets these have words of order of thousands covering a genre of words used. Also I came across the latest trend on "End-to-End speech recognition" papers by Baidu, Google and Microsoft. I am fine with any approach for getting me started to implement my ideas. — Jaswin, Nov 11 '19 at 11:23
Welcome to AI.SE @jaswin. Can you include some links to papers you have already read? Are you looking for a description of the _architecture_ of a speech recognition network, or something else? It sounds like maybe you are curious about lower level concepts? — John Doucette, Nov 11 '19 at 17:46

Is there a detailed description or implementation of an end-to-end speech recognition system?

0 Answers0