Highest Voted 'speech-synthesis' Questions - Artificial Intelligence Stack Exchange

3

votes

0 answers

Can computers recognise "grouping" from voice tonality?

In human communication, tonality or tonal language play many complex information, including emotions and motives. But excluding such complex aspects, tonality serves some a very basic purpose of "grouping" or "taking common" functions such as: The…

natural-language-processing voice-recognition speech-synthesis

asked Jul 16 '19 at 17:23

Always Confused

171
3

3

votes

2 answers

What is the difference between automatic transcription and automatic speech recognition?

What is the difference between automatic transcription and automatic speech recognition? Are they the same? Is my following interpretation correct? Automatic transcription: it converts the speech to text by looking at the whole spoken input…

natural-language-processing comparison speech-synthesis

asked Feb 11 '19 at 10:38

Murugesh

141
2

2

votes

2 answers

Open-source vocal cloning (speech-to-speech neural style transfer)

I want to program and train a voice cloner, in part to learn about this area of AI, and in part to use as a prototype of audio for testing and getting feedback from early adopters before recording in a studio with voice actors. For the prototype, I…

neural-networks tensorflow speech-recognition speech-synthesis style-transfer

asked Mar 02 '23 at 16:52

miguelmorin

101
5

2

votes

1 answer

How to measure the similarity the pronunciation of two words?

I would like to know how I could measure the pronunciation of two words. These two words are quite similar and differ only in one vowel. I know there is, e.g., the Hamming distance or the Levenshtein distance but they measure the "general"…

natural-language-processing natural-language-understanding speech-synthesis

asked Jul 07 '21 at 06:54

Ben

205
1
7

2

votes

0 answers

Model for direct audio-to-audio speech re-encoding

There are many resources available for text-to-audio (or vice versa) synthesis, for example Google's 'Wavenet'. These tools do not allow the finer degree of control that may be required regarding the degree of inflections / tonality retained in…

audio-processing model-request speech-synthesis

asked May 19 '21 at 10:09

NeverWasMyRealName

21
1

2

votes

0 answers

How do I train a multiple-speaker model (speech synthesis) based on Tacotron 2 and espnet?

I'm new to Speech Synthesis & Deep Learning. Recently, I got a task as described below: I have problem in training a multi-speaker model which should be created by Tacotron2. And I was told I can get some ideas from espnet, which is a end-to-end…

deep-learning recurrent-neural-networks audio-processing speech-recognition speech-synthesis

asked Feb 06 '20 at 04:13

Envelo Lee

21
1

2

votes

0 answers

What is the State-of-the-Art open source Voice Cloning tool right now?

I would like to clone a voice as precisely as possible. Lately, impressive models have been released that only need about 10 s of voice input (cf. https://github.com/CorentinJ/Real-Time-Voice-Cloning), but I would like to go beyond that and clone a…

natural-language-processing speech-synthesis

asked Sep 18 '19 at 09:04

Remind

21
1

1

vote

0 answers

Is Speech to Speech with changing the voice to a given other voice possible?

Background: I am working on a research project to use (demonstrate) the possibilities of Machine Learning and AI in artistic projects. One thing we are exploring is demonstrating deep fakes on stage. Of course, a deep fake is not easy to make.…

natural-language-processing reference-request deepfakes speech-synthesis

asked Sep 18 '21 at 09:06

Nathan

143
4

1

vote

0 answers

How many spectrogram frames per input character does text-to-speech (TTS) system Tacotron-2 generate?

I've been reading on Tacotron-2, a text-to-speech system, that generates speech just-like humans (indistinguishable from humans) using the GitHub https://github.com/Rayhane-mamah/Tacotron-2. I'm very confused about a simple aspect of text-to-speech…

recurrent-neural-networks word-embedding attention text-classification speech-synthesis

asked May 14 '20 at 22:39

Joe Black

181
6

1

vote

0 answers

Can't figure out what's going wrong with my dataset construction for multivariate regression

TL;DR: I can't figure out why my neural network wont give me a sensible output. I assume it's something to do with how I'm presenting the input data to it but I have no idea how to fix it. Background: I am using matched pairs of speech samples to…

tensorflow python keras speech-synthesis

asked Dec 19 '19 at 11:56

NotQuiteHere

19
1

1

vote

0 answers

Improving the performance of a DNN model

I have been executing an open-source Text-to-speech system Ossian. It uses feed forward DNNs for it's acoustic modeling. The error graph I've got after running the acoustic model looks like this: Here are some relevant information: Size of Data: 7…

deep-learning speech-synthesis

asked Jul 17 '19 at 11:01

Arif Ahmad

111
1

0

votes

1 answer

Adding voices to voice synthesis corpuses

If one uses one of the open source implementations of the WaveNet generative speech synthesis design, such as https://r9y9.github.io/wavenet_vocoder/, and trains using something like the CMU's arctic corpus, now can one add a voice that sounds…

training generative-model speech-synthesis

asked Sep 16 '18 at 14:55

Douglas Daseeco

7,423
1
26
62

0

votes

1 answer

What is the best Text-to-speech model available open-source?

I tried a couple of different websites and libraries. Also found this topic from 3.5 years ago - What are the current open source text-to-audio libraries? It looks like nobody published anything in the last couple of years and most solutions are…

search models speech-synthesis

asked Aug 30 '23 at 18:46

Yevhen Salitrynskyi

9
1

0

votes

0 answers

How exactly to create voice audio snippets that blend together into an AI voice?

I just asked the more general question, How to create AI voice generator for fantasy language? Now after asking ChatGPT for some details on how that works, I am concerned about how you would go about creating the "database" of sound snippets…

speech-synthesis

asked Aug 14 '23 at 23:03

Lance

153
4

0

votes

0 answers

Why was Tacotron trained on <1000h of speech data?

Tacotron TTS models (e.g. Tacotron 2 and Parallel Tacotron 2) were trained on 25h and 405h of speech data respectively. By comparison, more recent TTS systems are trained on >50,000h of speech data. Why were Tacotron models trained on such a…

speech-synthesis

asked Jul 11 '23 at 14:11

Nik

1

Questions tagged [speech-synthesis]