I wonder on the following concept:
A given neural network gets two audio input (preferably music) and gives a real number between 0 and 1 which describes "similarity" between the second and the first track.
As far as my understanding of neural networks go, the problem fits the concept of NNs, as pattern recognition in music can help determine similarities and discrepancies in audio, see voice recognition.
However, due to the nature of long and complex inputs, and the vague nature of learning datasets (how similar, for instance, Diana Ross "It's your move", and The Vaporwave legend "Floral Shoppe" exactly are? 0.9? 0.6? other?), such a network would be extremely slow and convoluted.
Is it possible today to build and train such a model? If yes, how would it look like?