1

I've been reading on Tacotron-2, a text-to-speech system, that generates speech just-like humans (indistinguishable from humans) using the GitHub https://github.com/Rayhane-mamah/Tacotron-2.

I'm very confused about a simple aspect of text-to-speech even after reading the paper several times. Tacotron-2 generates spectrogram frames for a given input-text. During training, the dataset is a text sentence and its generated spectrogram (it seems at a rate of 12.5 ms per spectogram frame).

  • If the input is provided as a character string, then how many spectogram frames does it predict for each character?

  • How does training supply which frames form the expected output from the dataset? Because the training dataset is simply a thousand of frames for a sentences, how does it know which frames are ideal output for a given character?

This basic aspect seems just not mentioned clearly anywhere and I'm having a hard time figuring this one out.

nbro
  • 39,006
  • 12
  • 98
  • 176
Joe Black
  • 181
  • 6

0 Answers0