How exactly to create voice audio snippets that blend together into an AI voice?

Question

I just asked the more general question, How to create AI voice generator for fantasy language? Now after asking ChatGPT for some details on how that works, I am concerned about how you would go about creating the "database" of sound snippets recording someones voice (like me recording my voice). How does that work exactly, what are all the pieces required to make a great sounding voice like Siri or Google's Map voice? Maybe it doesn't have to be that perfect, but something better than a simple old-school voice generator that sounds mechanical.

I am wondering for a "fantasy" language. I have 27 letters, which each represent a single phoneme (22 consonants, 5 vowels). I don't have every rule written down on which letters can go next to each other to form consonant clusters (there are no diphthong or "vowel clusters" allowed), but maybe I can put a list together for that, if necessary.

ChatGPT beats around the bush when it comes to explaining how exactly I should create sound recordings, so I wanted to clarify how exactly that works in this question:

Do I need to record each sound individually, and as part of a cluster/syllable? How many variants do I need to create for each sound, so it sounds good in the end? Like is 2 versions for each sound enough? If I have at least 100 consonant clusters allowed, and 2000 possible syllables, what am I looking at roughly speaking in terms of the number of recordings I need?
How do I "splice" up the recordings? Do I do it in some automated way or is this part all manual? I don't know how to efficiently splice audio, so I would need someone else to help me with that, so what should I provide them? And what should the end result of splicing be? (I'm not asking "how to splice audio", that is easy for someone skilled, I am basically asking what I should splice into). And how do I label the splicings? If there are going to be required 1000's of splices, is there a way to automate this in some ways?
Now say I have labelled splices. How do I weave them together, and blend them so they sound natural? Is that where the AI training comes in? What do I need to be aware of and try to implement at this stage?
Given I have a way to blend/weave the sounds together, how do I go from text to audio? (Given my letters each mean one sound). If there are multiple variants of each sound (which seems like a lot of work), then how does it select which variant, etc..?

Just looking for some high-level details on the "audio snippet generation" part. Don't need to know actual code, as I suspect I will have to write my own code for this, but knowing what I need to do in detail is the first step.

How exactly to create voice audio snippets that blend together into an AI voice?

0 Answers0