I've been reading on Tacotron-2, a text-to-speech system, that generates speech just-like humans (indistinguishable from humans) using the GitHub https://github.com/Rayhane-mamah/Tacotron-2.
I'm very confused about a simple aspect of text-to-speech even after reading the paper several times. Tacotron-2 generates spectrogram frames for a given input-text. During training, the dataset is a text sentence and its generated spectrogram (it seems at a rate of 12.5 ms per spectogram frame).
If the input is provided as a character string, then how many spectogram frames does it predict for each character?
How does training supply which frames form the expected output from the dataset? Because the training dataset is simply a thousand of frames for a sentences, how does it know which frames are ideal output for a given character?
This basic aspect seems just not mentioned clearly anywhere and I'm having a hard time figuring this one out.