Could a CNN hear the difference between sound of a pet moving, and a person?

Question

"Deep Learning" neural networks are now successful at image-recognition tasks that I would not have expected say 10 years ago. I wonder if the current state of the art in machine learning could generally tell the difference between the sound of a dog or cat moving around a house, and a person walking in the same area, taking as input only the sound captured by a microphone. I think I could generally tell the difference, but it is hard to explain exactly how. But this is also true of some tasks that deep learning is now succeeding at. So, I suspect it is possible but it's not clear how you would go about it.

I have found algorithms to detect human speech (wikipedia:"Voice activity detection") but separating animal and human footsteps seems more subtle.

Welcome to AI! This sounds entirely possible. While it would be easy to identify a dog, say, on a hard floor by the scrabbling of their nails, I'm guessing that relative weight of the subjects would be a reliable dataset, even on carpet. Humans weigh more than most dogs, certainly than most cats. Patterns of footfalls would likely be distinct for all three. — DukeZhou, Nov 22 '17 at 23:23

score 1 · Answer 1 · answered Nov 27 '17 at 07:25

It is an interesting application. It is possible. You can interpret sound as histogram (2D image) and apply same image processing techniques (CNN) to extract information. Alternatively, you can keep them as phase / intensity values and train a network on top of them (RNN). That is a great idea. Go for it!

Neil Slater · Accepted Answer · 2017-12-02T21:30:58.313

I think it should be possible. The main difficulty will be in getting enough labelled data, deep learning approaches are very data hungry. Audio tasks such as speech typically require hours of labelled data. It's hard to tell from the problem description whether you would need that much - the classification is simpler (just two classes, not identifying phonemes), but I think the perceptual difference between the classes is harder, as you say yourself you are not quite sure how you are sensing the difference.

You might have success with a relatively low amount of data if you can find a pre-trained deep network for some other audio task (with a similar domain e.g. activity classification), and fine tuning the last few layers using your own data. One such pre-trained model is called VGGish and has been trained on 100s of hours of short video soudtracks.

Could a CNN hear the difference between sound of a pet moving, and a person?

2 Answers2