How do I train a multiple-speaker model (speech synthesis) based on Tacotron 2 and espnet?

Question

I'm new to Speech Synthesis & Deep Learning. Recently, I got a task as described below:

I have problem in training a multi-speaker model which should be created by Tacotron2. And I was told I can get some ideas from espnet, which is a end-to-end audio tools library. In this way, I found a good dataset called libritts: http://www.openslr.org/60/. And it's also found at espnet: https://github.com/espnet/espnet#tts-results

Here is my initial thought：

Download libritts corpus / Read the espnet code: ../espnet/egs/libritts/tts1/run.sh, learning how to train the libritts corpus by pytorch-backend.

Difficulty: But I cannot get it across that how the author trained a libritts.tacotron2.v1 as I didn't found anything about tacotron2 along those shells related to run.sh. Maybe He didn't make those codes open-source.

Read the tacotron2 code and tune it into a multi-speaker network:

Difficulty: I found the code is really complex.... I just got lost in reading these codes, without a clear understanding about how to tune this model. Cause Tacotron2 was designed with a LJSpeech Dataset(only 1 person).

Training the multi-speaker model with tiny set of dataset (http://www.openslr.org/60/) to save time.

they contains about 110 people's data, which can be enough for my scenario.

In the end:

Coud you please help me about my questions. I've been puzzled by this problem for a long time...

How do I train a multiple-speaker model (speech synthesis) based on Tacotron 2 and espnet?

Here is my initial thought：

In the end:

0 Answers0