2

I want to make a screencast substituting my voice with an audio generated by a TTS engine. The problem is if I use a stt followed by a tts the generated audio will not be in sync with my voice and won't make any sense in the screencast. Is there a way to do this so that the generated audio is in sync with my voice?

I'm not talking about a live stream, just a video. I can still edit it, but I would prefer not to have to copy every piece of the generated audio into it's proper place. I would expect something that cut's the audio into pieces by itself and moves it to roughly match my voice spectrogram I think it's called. So I don't have to do everything manually.

I had thought of just creating the TTS audio separately by writing myself into a text file and using coqui-tts to do tts --text "$(cat tts_input.txt)". Then split the audio from the video, import both the TTS audio and the audio from the video to audacity and cut and move around the generated audio until it matches the audio from the video. Then merge the audio with the video again.

I'm looking for a way to automatize that process because honestly it seems very involved and I would get tired of it really quickly.

1 Answers1

2

You’re absolutely correct that going from Speech-to-Text, then Text-to-Speech would be too slow. Even on the most powerful computers you will not get the latency below 400ms. This is primarily because the STT component cannot convert a spoken word to a text representation until after you’ve finished saying the word, and a TTS component cannot convert a written piece of text to a semi-accurate representation of sound in the English language until after the first few syllables are in place. Add to this the time many packages require while waiting for context behind a statement, and you’re at half-a-second in no time at all. This is much easier for languages such as Japanese where each character has its own sound that does not change depending on what characters are surrounding it (leaving aside ゃ, ゅ, and ょ for the time being).

However, if the goal is to substitute your voice in real-time without burning a hole in your CPU, you may want to look at voice modulators. These are often depicted as tools used by hackers or nefarious individuals, but they’re really quite good at giving people a different sound. Lyrebird is a pretty solid tool for this, but there are many others that can integrate with streaming applications, simplifying your processes.

matigo
  • 22,138
  • 7
  • 45
  • 75
  • +1 for the Japanese reference. And about the voice modulators they're as much of tool to scambaiters, the good guys, like Trilogy Media, Kitboga, Pierogi, etc. – ChanganAuto Mar 16 '22 at 13:17
  • 3
    @ChanganAuto I immigrated to Japan many years ago and have written applications that make heavy use of STT and TTS across different languages, dialects, and accents. There's not much that I can say I know but, when it comes to having computers understand and work with spoken language in a Star Trek-like way, I have learned what not to do – matigo Mar 16 '22 at 13:23