4

I wonder on the following concept:

A given neural network gets two audio input (preferably music) and gives a real number between 0 and 1 which describes "similarity" between the second and the first track.

As far as my understanding of neural networks go, the problem fits the concept of NNs, as pattern recognition in music can help determine similarities and discrepancies in audio, see voice recognition.

However, due to the nature of long and complex inputs, and the vague nature of learning datasets (how similar, for instance, Diana Ross "It's your move", and The Vaporwave legend "Floral Shoppe" exactly are? 0.9? 0.6? other?), such a network would be extremely slow and convoluted.

Is it possible today to build and train such a model? If yes, how would it look like?

malioboro
  • 2,729
  • 3
  • 20
  • 46
Zoltán Schmidt
  • 623
  • 7
  • 14
  • 1
    i don't think sound recognition in humans is well-understood enough to be able to construct a model that does what you naively want to do (these two sounds are similar) – k.c. sayz 'k.c sayz' Aug 23 '17 at 14:56

1 Answers1

3

Yes, it is possible, even if the best approach could be different from neural networks. Anyway, you should extract some significant features from the audio (energy, onsets, root frequencies, and other). Usually, more features than those really needed are extracted and afterwards the most sigificant are selected through some algorithm (e.g. PCA). In this way you will obtain an array of features (say between 10 and 100 features) with which you can train your NNs.

Note that NNs do not tell you why two audio are similar but only if they are or not. This is a big disadvantage. Instead, algorithms based on grey-box modeling such as rule or case based algorithm (maybe using fuzzy logic) could be more useful, provided that you have a deeper knowledge of the problem.

References and deepening sources: SMC Lab from University of Padua education material

fortea
  • 146
  • 2