For questions related to the synthesis of speech, not to be confused with synthesizing text or formal language expressions or expressions in context free grammars. Speech in this context is either a sequence of audio samples, a sequence of spectral representations in the frequency domain, or a representation of phonic symbols that represent natural speech.
Questions tagged [speech-synthesis]
17 questions
3
votes
0 answers
Can computers recognise "grouping" from voice tonality?
In human communication, tonality or tonal language play many complex information, including emotions and motives. But excluding such complex aspects, tonality serves some a very basic purpose of "grouping" or "taking common" functions such as:
The…

Always Confused
- 171
- 3
3
votes
2 answers
What is the difference between automatic transcription and automatic speech recognition?
What is the difference between automatic transcription and automatic speech recognition? Are they the same?
Is my following interpretation correct?
Automatic transcription: it converts the speech to text by looking at the whole spoken input…

Murugesh
- 141
- 2
2
votes
2 answers
Open-source vocal cloning (speech-to-speech neural style transfer)
I want to program and train a voice cloner, in part to learn about this area of AI, and in part to use as a prototype of audio for testing and getting feedback from early adopters before recording in a studio with voice actors. For the prototype, I…

miguelmorin
- 101
- 5
2
votes
1 answer
How to measure the similarity the pronunciation of two words?
I would like to know how I could measure the pronunciation of two words. These two words are quite similar and differ only in one vowel.
I know there is, e.g., the Hamming distance or the Levenshtein distance but they measure the "general"…

Ben
- 205
- 1
- 7
2
votes
0 answers
Model for direct audio-to-audio speech re-encoding
There are many resources available for text-to-audio (or vice versa) synthesis, for example Google's 'Wavenet'.
These tools do not allow the finer degree of control that may be required regarding the degree of inflections / tonality retained in…

NeverWasMyRealName
- 21
- 1
2
votes
0 answers
How do I train a multiple-speaker model (speech synthesis) based on Tacotron 2 and espnet?
I'm new to Speech Synthesis & Deep Learning. Recently, I got a task as described below:
I have problem in training a multi-speaker model which should be created by Tacotron2. And I was told I can get some ideas from espnet, which is a end-to-end…

Envelo Lee
- 21
- 1
2
votes
0 answers
What is the State-of-the-Art open source Voice Cloning tool right now?
I would like to clone a voice as precisely as possible. Lately, impressive models have been released that only need about 10 s of voice input (cf. https://github.com/CorentinJ/Real-Time-Voice-Cloning), but I would like to go beyond that and clone a…

Remind
- 21
- 1
1
vote
0 answers
Is Speech to Speech with changing the voice to a given other voice possible?
Background:
I am working on a research project to use (demonstrate) the possibilities of Machine Learning and AI in artistic projects. One thing we are exploring is demonstrating deep fakes on stage. Of course, a deep fake is not easy to make.…

Nathan
- 143
- 4
1
vote
0 answers
How many spectrogram frames per input character does text-to-speech (TTS) system Tacotron-2 generate?
I've been reading on Tacotron-2, a text-to-speech system, that generates speech just-like humans (indistinguishable from humans) using the GitHub https://github.com/Rayhane-mamah/Tacotron-2.
I'm very confused about a simple aspect of text-to-speech…

Joe Black
- 181
- 6
1
vote
0 answers
Can't figure out what's going wrong with my dataset construction for multivariate regression
TL;DR: I can't figure out why my neural network wont give me a sensible output. I assume it's something to do with how I'm presenting the input data to it but I have no idea how to fix it.
Background:
I am using matched pairs of speech samples to…

NotQuiteHere
- 19
- 1
1
vote
0 answers
Improving the performance of a DNN model
I have been executing an open-source Text-to-speech system Ossian. It uses feed forward DNNs for it's acoustic modeling. The error graph I've got after running the acoustic model looks like this:
Here are some relevant information:
Size of Data: 7…

Arif Ahmad
- 111
- 1
0
votes
1 answer
Adding voices to voice synthesis corpuses
If one uses one of the open source implementations of the WaveNet generative speech synthesis design, such as https://r9y9.github.io/wavenet_vocoder/, and trains using something like the CMU's arctic corpus, now can one add a voice that sounds…

Douglas Daseeco
- 7,423
- 1
- 26
- 62
0
votes
1 answer
What is the best Text-to-speech model available open-source?
I tried a couple of different websites and libraries. Also found this topic from 3.5 years ago - What are the current open source text-to-audio libraries?
It looks like nobody published anything in the last couple of years and most solutions are…
0
votes
0 answers
How exactly to create voice audio snippets that blend together into an AI voice?
I just asked the more general question, How to create AI voice generator for fantasy language? Now after asking ChatGPT for some details on how that works, I am concerned about how you would go about creating the "database" of sound snippets…

Lance
- 153
- 4
0
votes
0 answers
Why was Tacotron trained on <1000h of speech data?
Tacotron TTS models (e.g. Tacotron 2 and Parallel Tacotron 2) were trained on 25h and 405h of speech data respectively. By comparison, more recent TTS systems are trained on >50,000h of speech data. Why were Tacotron models trained on such a…

Nik
- 1