I want to make a screencast substituting my voice with an audio generated by a TTS engine. The problem is if I use a stt followed by a tts the generated audio will not be in sync with my voice and won't make any sense in the screencast. Is there a way to do this so that the generated audio is in sync with my voice?
I'm not talking about a live stream, just a video. I can still edit it, but I would prefer not to have to copy every piece of the generated audio into it's proper place. I would expect something that cut's the audio into pieces by itself and moves it to roughly match my voice spectrogram I think it's called. So I don't have to do everything manually.
I had thought of just creating the TTS audio separately by writing myself into a text file and using coqui-tts to do tts --text "$(cat tts_input.txt)"
. Then split the audio from the video, import both the TTS audio and the audio from the video to audacity and cut and move around the generated audio until it matches the audio from the video. Then merge the audio with the video again.
I'm looking for a way to automatize that process because honestly it seems very involved and I would get tired of it really quickly.