"Deep Learning" neural networks are now successful at image-recognition tasks that I would not have expected say 10 years ago. I wonder if the current state of the art in machine learning could generally tell the difference between the sound of a dog or cat moving around a house, and a person walking in the same area, taking as input only the sound captured by a microphone. I think I could generally tell the difference, but it is hard to explain exactly how. But this is also true of some tasks that deep learning is now succeeding at. So, I suspect it is possible but it's not clear how you would go about it.
I have found algorithms to detect human speech (wikipedia:"Voice activity detection") but separating animal and human footsteps seems more subtle.