I want to build a personal assistant that listens to me continuously..
The flow looks like this:
- continuously record voice
- stream it to google speech api.
- get back the text in real time -> parse for intent etc..
Problem is, google speech api gets expensive if you record for hours. A better way to do it is to only submit the parts where I'm actually speaking to it.. Then the cost of running this full time (17 hours a day, every day) becomes very accessible. Now my question is:
How can i detect that a voice is present in the microphone stream?
I have a lot of noise in the background, dumb increase in volume detection
is not a very good solution. I need something more intelligent. It doesn't need to be very accurate - just good enough to not break my cloud computing budget. I'm thinking that human voice sounds distinct enough that is not such a big problem to detect when is there.
What do you recommend me to do, given this is a real time stream - not an audio file.
The audio will be generated in chromium browser (electron) - with getUserMedia
API, and using node.js
i plan to handle the streaming logic.
Note: there is a built in speechRecognition api
in electron
- but from my experience currently it doesn't work (not even after i give it my API key), and even if would have worked, i think it has the same cost problem. So this is why i'm trying to provide my own implementation.
I don't know what i'm doing, any insight is welcomed:) Thank you.