1

I'm try to train a RNN with a chunk of audio data, where X and Y are two audio channels loaded into numpy arrays. The objective is to experiment with different NN designs to train them to transform single channel (mono) audio into a two channel (stereo) audio.

My questions are:

  1. Do I need a stateful network type, like LSTM? (I think yes.)
  2. How should I organize the data, considering that there are millions of samples and I can't load into memory a matrix of each window of data in a reasonable time-span?

For example if I have an array with: [0, 0.5, 0.75, 1, -0.5, 0.22, -0.30 ...] and I want to take a window of 3 samples, for example. I guess I need to create a matrix with every sample shift like this, right?

[[0.00, 0.50, 0.75]
 [0.50, 0.75, 1.00]
 [0.75, 1.00,-0.50]
 [1.00,-0.50, 0.22]]

Where is my batch_size? Should I make the matrix like this per each sample shift? Per each window? This may be very memory consuming if I intend to load a 4 min song.

Is this example matrix a single batch? A single sample?

Douglas Daseeco
  • 7,423
  • 1
  • 26
  • 62
Dmitry
  • 19
  • 2

1 Answers1

1
  1. Yes, intuition says that RNNs like LSTM or GRU will work better in your case, because predicted values might depend on input patterns corresponding to much earlier time intervals.
  2. There is no reason to create samples shifted by a single measurement because many of the samples will contain pretty much the same information for your model. A viable approach is to shift by sample size. Keeping some overlap between samples is feasible as well.

Keep in mind that as a general rule for processing audio data, it makes sense to convert raw audio data into vectors representing audio spectrum before feeding it into LSTM RNN (see this video for e.g. https://www.coursera.org/learn/nlp-sequence-models/lecture/sjiUm/speech-recognition).

Batch size is different from sample or window size in your case.

Roman
  • 21
  • 2