Model for direct audio-to-audio speech re-encoding

Question

There are many resources available for text-to-audio (or vice versa) synthesis, for example Google's 'Wavenet'.

These tools do not allow the finer degree of control that may be required regarding the degree of inflections / tonality retained in output. For example to change vocal characteristics (Implied Ethnicity / Sexbfor example) of a dubbed voice over from one voice whilst retaining tonality (Shouting vs calm).

Text-to-speech 'and back' seems a suboptimal approach due to data loss (e.g. tonality) before reconstruction.

Re-encoding audio-to-audio would/may allow the alteration of characteristics in a manner not available via standard audio processing methods whilst retaining more of the desired tonality.

Is AI able to distinguish between characteristics and tonality as implied above and is such a speach-speach re-encoding tool available, ideally open source?

Hi, I am having a side-project on this trying to record with a bad mic and a good mic concurrently and training a model on mapping low-quality audio to the high quality space! let me know if this is of interest and I will update once I have gotten results. Note this is much simpler than what I understand you are asking. — NeuroEng, Dec 23 '21 at 14:40

Model for direct audio-to-audio speech re-encoding

0 Answers0