I'm try to train a RNN with a chunk of audio data, where X and Y are two audio channels loaded into numpy arrays. The objective is to experiment with different NN designs to train them to transform single channel (mono) audio into a two channel (stereo) audio.
My questions are:
- Do I need a stateful network type, like LSTM? (I think yes.)
- How should I organize the data, considering that there are millions of samples and I can't load into memory a matrix of each window of data in a reasonable time-span?
For example if I have an array with: [0, 0.5, 0.75, 1, -0.5, 0.22, -0.30 ...] and I want to take a window of 3 samples, for example. I guess I need to create a matrix with every sample shift like this, right?
[[0.00, 0.50, 0.75]
[0.50, 0.75, 1.00]
[0.75, 1.00,-0.50]
[1.00,-0.50, 0.22]]
Where is my batch_size? Should I make the matrix like this per each sample shift? Per each window? This may be very memory consuming if I intend to load a 4 min song.
Is this example matrix a single batch? A single sample?