How to detect when human voice / speech appears in an microphone stream?

Question

I want to build a personal assistant that listens to me continuously..

The flow looks like this:

continuously record voice
stream it to google speech api.
get back the text in real time -> parse for intent etc..

Problem is, google speech api gets expensive if you record for hours. A better way to do it is to only submit the parts where I'm actually speaking to it.. Then the cost of running this full time (17 hours a day, every day) becomes very accessible. Now my question is:

How can i detect that a voice is present in the microphone stream?

I have a lot of noise in the background, dumb increase in volume detection is not a very good solution. I need something more intelligent. It doesn't need to be very accurate - just good enough to not break my cloud computing budget. I'm thinking that human voice sounds distinct enough that is not such a big problem to detect when is there.

What do you recommend me to do, given this is a real time stream - not an audio file.

The audio will be generated in chromium browser (electron) - with getUserMedia API, and using node.js i plan to handle the streaming logic.

Note: there is a built in speechRecognition api in electron - but from my experience currently it doesn't work (not even after i give it my API key), and even if would have worked, i think it has the same cost problem. So this is why i'm trying to provide my own implementation.

I don't know what i'm doing, any insight is welcomed:) Thank you.

Welcome to AI. I've added a "voice recognition" tag since this is in that general field. — DukeZhou, Nov 12 '17 at 22:10

Jaden Travnik · Answer 1 · 2017-11-13T19:05:55.950

Your problem is an old one. There are many methods, referred to as Voice Activity Detection (VAD) methods, which detect speech from an audio signal.

The typical design of a VAD algorithm follows one of more of these 3 approaches:

A noise reduction stage, e.g. via spectral subtraction.
Some features or quantities are calculated from a section of the input signal.
A classification rule is applied to classify the section as speech or non-speech – often this classification rule finds when a value exceeds a threshold.

There may be some feedback in this sequence, where parameters (like the noise threshold or the classification threshold) are tuned to improve the estimate or to adaptively vary the threshold(s). These feedback operations improve the VAD performance in non-stationary noise (i.e. when the noise varies a lot).

Check out this wiki page on Voice Activity Detection for more information.

Ah, that does look like a good start! In the interest of keeping answers as self-contained as possible, could you maybe include an outline of the architecture? That would be great. — Ben N, Nov 12 '17 at 16:19

How to detect when human voice / speech appears in an microphone stream?

1 Answers1