In the media there's lot of talk about face recognition, mainly with respect to identifying faces (= assigning to persons). Less attention is paid to the recognition of facially expressed emotions but there's a lot of research done into this direction, too. Even less attention is paid to the recognition of facially expressed emotions of a single person (which could be much more detailed) - even though this would be a very interesting topic.
What holds for faces does similarily hold for voices. With the help of artifical intelligence voices can be identified (= assigned to persons) and emotions as expressed by voice can be recognized - on a general and on an individual's level.
My general question goes into another direction: As huge progress has been made in visual scene analysis ("what is seen in this scene?") there has probably been some progress made in auditory scene analysis: "What is heard in this scene?"
My specific question is: Are there test cases and results where some AI software was given some auditory data with a lot of "voices" and could tell how many voices there were?
As a rather easy specific test case consider some Gregorian chant sung in perfect unison. (See also here.)