How many spectrogram frames per input character does text-to-speech (TTS) system Tacotron-2 generate?

Asked May 14 '20 at 22:39

Active May 15 '20 at 15:04

Viewed 50 times

I've been reading on Tacotron-2, a text-to-speech system, that generates speech just-like humans (indistinguishable from humans) using the GitHub https://github.com/Rayhane-mamah/Tacotron-2.

I'm very confused about a simple aspect of text-to-speech even after reading the paper several times. Tacotron-2 generates spectrogram frames for a given input-text. During training, the dataset is a text sentence and its generated spectrogram (it seems at a rate of 12.5 ms per spectogram frame).

If the input is provided as a character string, then how many spectogram frames does it predict for each character?
How does training supply which frames form the expected output from the dataset? Because the training dataset is simply a thousand of frames for a sentences, how does it know which frames are ideal output for a given character?

This basic aspect seems just not mentioned clearly anywhere and I'm having a hard time figuring this one out.

edited May 15 '20 at 15:04

nbro

39,006
12
98
176

asked May 14 '20 at 22:39

Joe Black

How many spectrogram frames per input character does text-to-speech (TTS) system Tacotron-2 generate?

0 Answers0