Questions tagged [speech-synthesis]

For questions related to the synthesis of speech, not to be confused with synthesizing text or formal language expressions or expressions in context free grammars. Speech in this context is either a sequence of audio samples, a sequence of spectral representations in the frequency domain, or a representation of phonic symbols that represent natural speech.

17 questions
3
votes
0 answers

Can computers recognise "grouping" from voice tonality?

In human communication, tonality or tonal language play many complex information, including emotions and motives. But excluding such complex aspects, tonality serves some a very basic purpose of "grouping" or "taking common" functions such as: The…
3
votes
2 answers

What is the difference between automatic transcription and automatic speech recognition?

What is the difference between automatic transcription and automatic speech recognition? Are they the same? Is my following interpretation correct? Automatic transcription: it converts the speech to text by looking at the whole spoken input…
2
votes
2 answers

Open-source vocal cloning (speech-to-speech neural style transfer)

I want to program and train a voice cloner, in part to learn about this area of AI, and in part to use as a prototype of audio for testing and getting feedback from early adopters before recording in a studio with voice actors. For the prototype, I…
2
votes
1 answer

How to measure the similarity the pronunciation of two words?

I would like to know how I could measure the pronunciation of two words. These two words are quite similar and differ only in one vowel. I know there is, e.g., the Hamming distance or the Levenshtein distance but they measure the "general"…
2
votes
0 answers

Model for direct audio-to-audio speech re-encoding

There are many resources available for text-to-audio (or vice versa) synthesis, for example Google's 'Wavenet'. These tools do not allow the finer degree of control that may be required regarding the degree of inflections / tonality retained in…
2
votes
0 answers

How do I train a multiple-speaker model (speech synthesis) based on Tacotron 2 and espnet?

I'm new to Speech Synthesis & Deep Learning. Recently, I got a task as described below: I have problem in training a multi-speaker model which should be created by Tacotron2. And I was told I can get some ideas from espnet, which is a end-to-end…
2
votes
0 answers

What is the State-of-the-Art open source Voice Cloning tool right now?

I would like to clone a voice as precisely as possible. Lately, impressive models have been released that only need about 10 s of voice input (cf. https://github.com/CorentinJ/Real-Time-Voice-Cloning), but I would like to go beyond that and clone a…
Remind
  • 21
  • 1
1
vote
0 answers

Is Speech to Speech with changing the voice to a given other voice possible?

Background: I am working on a research project to use (demonstrate) the possibilities of Machine Learning and AI in artistic projects. One thing we are exploring is demonstrating deep fakes on stage. Of course, a deep fake is not easy to make.…
1
vote
0 answers

How many spectrogram frames per input character does text-to-speech (TTS) system Tacotron-2 generate?

I've been reading on Tacotron-2, a text-to-speech system, that generates speech just-like humans (indistinguishable from humans) using the GitHub https://github.com/Rayhane-mamah/Tacotron-2. I'm very confused about a simple aspect of text-to-speech…
1
vote
0 answers

Can't figure out what's going wrong with my dataset construction for multivariate regression

TL;DR: I can't figure out why my neural network wont give me a sensible output. I assume it's something to do with how I'm presenting the input data to it but I have no idea how to fix it. Background: I am using matched pairs of speech samples to…
1
vote
0 answers

Improving the performance of a DNN model

I have been executing an open-source Text-to-speech system Ossian. It uses feed forward DNNs for it's acoustic modeling. The error graph I've got after running the acoustic model looks like this: Here are some relevant information: Size of Data: 7…
Arif Ahmad
  • 111
  • 1
0
votes
1 answer

Adding voices to voice synthesis corpuses

If one uses one of the open source implementations of the WaveNet generative speech synthesis design, such as https://r9y9.github.io/wavenet_vocoder/, and trains using something like the CMU's arctic corpus, now can one add a voice that sounds…
Douglas Daseeco
  • 7,423
  • 1
  • 26
  • 62
0
votes
1 answer

What is the best Text-to-speech model available open-source?

I tried a couple of different websites and libraries. Also found this topic from 3.5 years ago - What are the current open source text-to-audio libraries? It looks like nobody published anything in the last couple of years and most solutions are…
0
votes
0 answers

How exactly to create voice audio snippets that blend together into an AI voice?

I just asked the more general question, How to create AI voice generator for fantasy language? Now after asking ChatGPT for some details on how that works, I am concerned about how you would go about creating the "database" of sound snippets…
Lance
  • 153
  • 4
0
votes
0 answers

Why was Tacotron trained on <1000h of speech data?

Tacotron TTS models (e.g. Tacotron 2 and Parallel Tacotron 2) were trained on 25h and 405h of speech data respectively. By comparison, more recent TTS systems are trained on >50,000h of speech data. Why were Tacotron models trained on such a…
Nik
  • 1
1
2