How to prepare audio data for deep learning?

Question

Audio data is typically an array with the waveform represented by values from -1 to 1. There are two issues with that:

if all values are inverted, e.g. -1 becomes 1 and 1 becomes -1, the audio doesn't change. But if for example I need to find difference between two audio files, finding per-element difference will say two inverted audio arrays are very different. Realistically two sine waves can easily be shifted relative to each other in a way where they will be inverted to each other.
Related issue is that a wave, for example a 1000hz sound wave, often sounds like a "flat" sound. However in the array it is a literally a sine, and two sines can be shifted which causes inconsistent difference. Ideally a sine should be a sequence of the same number, which is obviously hard to do because audio is usually way more complex than a sine wave.

So what I tried doing, I make a copy of audio files, then I calculate a gradient which hopefully reduces the shifting issues, and then I convert array into absolute value (-1 turns 1). And then I use that copy when comparing audio arrays. When I used that for evaluating how close generated audio is to the original, it caused a lot of low frequencies in the generated audio. When I looked at the waveform, this is because gradient makes low frequencies very quiet since they have small rate of change, so my model doesn't see them. To be clear I am not really sure if gradient is a better match at all. But ideally I'd want something like gradient that doesn't reduce low frequencies.

There is also spectogram - admittedly I haven't looked much into it, but the one I tried - librosa spectogram functionality - takes quite a long time to convert that back into audio. If there is no quick way to do it with 1d arrays, I can use that.

@Rob I am asking what format will be better for an AI to process — nikishev., Feb 16 '23 at 17:46
It still seems [orthogonal](https://en.wiktionary.org/wiki/orthogonal) (definition 4) to artificial Intelligence, and [more like signal processing](https://ketanhdoshi.github.io/Audio-Mel/). There is where I'd suggest that you'd get a better answer, at https://dsp.stackexchange.com/ - they have an [artificial Intelligence tag](https://dsp.stackexchange.com/questions/tagged/artificial-intelligence) available for such questions. --- It seems like ***before*** *AI* rather than ***during*** *AI*. For example: https://dsp.stackexchange.com/q/75296/37400 gives **5** answers. — Rob, Feb 17 '23 at 05:48
I noticed that **we** do have a couple of questions with answers: https://ai.stackexchange.com/a/31721/17742 https://ai.stackexchange.com/a/27147/17742 — Rob, Feb 17 '23 at 05:48
Can you please explain how you calculate what you call "gradient"? Gradient of what? — nbro, Feb 22 '23 at 23:16

How to prepare audio data for deep learning?

0 Answers0