AI natural voice generator

Question

I want to create a solution, which clones my voice. I tried my commercial solutions or implementation of Tacotron. Unfortunately, results not sound natural, generated voice sounds like a robot. Anybody could recommend good alternative?

hi @fuwiak I posted an answer. Hope it can help you and have a nice day! — Clement, Nov 12 '19 at 15:57

score 2 · Accepted Answer · answered Nov 12 '19 at 15:57

The reason for robot like speech may be because tacotron uses griffin lim for vocoder, which cannot reproduce sound with perfection, often introducing robot like sound artifects.

A vocoder is a network that transforms a transform a spectrogram image back to speech waveform. Tacotron and many other speech generation neural network uses CNN to generate spectrogram instead of raw waveforms as output. Spectrogram is a lossy representation of raw audio waveform, so a perfect reconstruction of audio waveform is not possible. Griffin-Lim is a vocoder that uses algorithmic way to transform spectrogram to audio waveform, but often introduces a robot-like quality to generated waveforms. A neural network based vocoder can solve the problem. The wavenet vocoder is often used in speech generation as it can transform the spectrogram to audio with little artifects. Many new speech generation models use the wavenet vocoder as the deafult vocoder of the generation model. For a public implementation, this is a good github repository: https://github.com/r9y9/wavenet_vocoder

You can also use the newer tacotron 2 which uses the wavenet vocoder as the default vocoder. You can check it out here: https://github.com/Rayhane-mamah/Tacotron-2

In Tacotron-2, do you for each character in the input-text, how many frames of spectogram are generated during synthesis (inference)? The original paper at https://github.com/Rayhane-mamah/Tacotron-2 refers to sampling the "training dataset" at 1 frame per 12.5ms, which is different than number of frames per second. — Joe Black, May 14 '20 at 21:01
Also, how does it associate spectogram frames with a portion of the input text (e.g. an input character) during training phase, and what is number of frames per character during training? It must need to do that in order to learn using seq-2seq encoder/decoder system. — Joe Black, May 14 '20 at 21:03

AI natural voice generator

1 Answers1