For questions related to speech recognition, also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT), which is a sub-field of computational linguistics that enables the recognition and translation of spoken language into text by computers.
Questions tagged [speech-recognition]
28 questions
7
votes
2 answers
Term for algorithms that are not trained
Before the advent of neural architectures, many AI domains (e.g. speech recognition and computer vision) used algorithms that consisted of a series of hand-crafted transformations for feature extraction.
In speech recognition everything to do with…

Mew
- 181
- 2
4
votes
1 answer
How do AIs like Siri and Alexa respond to their names being called?
AIs like Siri and Alexa respond to their names being called. How does the system recognize the name by ignoring all the other words that have been said before their name? For example, "Hey Siri" would trigger Siri to start listening for commands,…

Sarem Hailemeskel
- 43
- 3
4
votes
1 answer
How does the CTC loss work?
I am trying to implement CTC loss in TensorFlow, but their documentation is pretty limited. So I am not sure how to approach the problem. I found a good example in Theano.
Are any other resources that explain the CTC loss?
I am also trying to…

user26787
- 41
- 2
3
votes
1 answer
Can transformer be better than RNN for online speech recognition?
Does transformer have the potential to replace RNN end-to-end models for speech recognition for online speech recognition? This mainly depends on accuracy/latency and deploy cost, not training cost. Can transformer support low latency online use…

jw_
- 199
- 1
- 5
3
votes
0 answers
Speaker Identification / Recognition for less size audio files
I am working on speaker identification problem using GMM (Gaussian Mixture Model). I have to just identify one user present in the given audio, so for second class noise or silent audio may use or not just like in image classification for an object…

Posi2
- 358
- 2
- 16
2
votes
2 answers
What is a beam?
For example, faster-whisper's transcribe function takes an argument
beam_size: Beam size to use for decoding.
What does "beam" mean?

Geremia
- 163
- 6
2
votes
2 answers
Open-source vocal cloning (speech-to-speech neural style transfer)
I want to program and train a voice cloner, in part to learn about this area of AI, and in part to use as a prototype of audio for testing and getting feedback from early adopters before recording in a studio with voice actors. For the prototype, I…

miguelmorin
- 101
- 5
2
votes
3 answers
Has there been research done regarding processing speech then building a "speaker profile" based off the processed speech?
Has there been research done regarding processing speech then building a "speaker profile" based off the processed speech? Things like matching the voice with a speaker profile and matching speech patterns and wordage for the speaker profile would…

Tory
- 175
- 6
2
votes
0 answers
How do I train a multiple-speaker model (speech synthesis) based on Tacotron 2 and espnet?
I'm new to Speech Synthesis & Deep Learning. Recently, I got a task as described below:
I have problem in training a multi-speaker model which should be created by Tacotron2. And I was told I can get some ideas from espnet, which is a end-to-end…

Envelo Lee
- 21
- 1
2
votes
1 answer
How to use AI for language recognition?
Given an audio track, I'm trying to find a way to recognize the audio language. Only within a small set (e.g. English vs Spanish). Is there a simple solution to detect the language in a speech?

Tina J
- 973
- 6
- 13
2
votes
1 answer
What is the difference between Kaldi and DeepSpeech speech recognition systems in their approach?
I would like to know how do Kaldi and DeepSpeech speech recognition systems differ algorithmically? Which one would be more accurate for continuous speech in time?

Hanu
- 31
- 1
- 3
2
votes
0 answers
Is there a detailed description or implementation of an end-to-end speech recognition system?
I am currently trying to implement an end-to-end speech recognition system from scratch, that is, without using any of the existing frameworks (like TensorFlow, Keras, etc.). I am building my own library, where I am trying to do a polynomial…

Jaswin
- 121
- 4
1
vote
0 answers
What is the number of channels of input audio mel spectrogram?
What is the number of channels of input audio mel spectrogram? For example, in CV we always have 3 input channels on RGB picture. But what about audio?

randomuser228
- 11
- 1
1
vote
0 answers
How to align or synchronize Youtube caption with audio accurately
I need to use the automatic caption from Youtube to precisely isolate excerpts from the video aligned to text and generate the dataset to train a model in French.
So I've already written the script, but when I compare the audio with the matching…

Cara Duf
- 11
- 3
1
vote
0 answers
Looking for help on initializing continuous HMM model for word level ASR
I have been studying HMM implementation approaches on ASR for the last couple of weeks. This probabilistic model is very new to me. I am currently using a Python package called Pomegranate to implement an ASR model of my own for the Librispeech…

Zander
- 11
- 1