I need to use the automatic caption from Youtube to precisely isolate excerpts from the video aligned to text and generate the dataset to train a model in French.
So I've already written the script, but when I compare the audio with the matching text, I noticed that the text is often delayed (positive or negative). For example, the text reads "1 2 3 4" and the audio says "0 1 2 3" ("0" comes from the previous clip).
If you have a look at a Youtube video in French, when you click on "open transcript", you can also notice this delay.
Here is an example that is very noticeable on short clips: The audio says "conditions de travail" whereas the transcript reads "de travail".
I measured the delay in Audacity and it is not consistent across the clips. Please note that it does not seem to happen in English videos.
If I use Google Speech Recognition in Python (recognize_google) on audio clips, there are no such delays (also because the clips are already separated) but the punctuation is missing which is not good for training my model.
Why can't Google align more accurately the audio and the text (caption)?
Can you suggest a better way of aligning audio with text?