3

I would provide a sound signal of about 2-3 seconds to my neural network. I have trained my network with a single word, like if I speak "Hello" the network may tell if "Hello" is spoken or not, but some other word like "World" is spoken, it will say "Hello" is not spoken. I just want classification of sound if its a specific command or word. What is the best way to do this, I am not a that much advanced in DNN, I only know about NN and CNN, I want to know if there is some research paper or tutorial, or need some explanation about the work.

1 Answers1

3

If you have fixed length speech data you can detect the content using only CNN. You can see that problem as a binary classification (1 if the spoken word is correct, 0 otherwise).

But first, you need to make the input length is fixed. For example, you use 2 seconds as the fixed length. If the recorded speech is more than 2 seconds, you need to crop it, and if the recorded speech is less than 2 seconds you can pad it with 0 values.

Next, You can either use raw data (time-domain) or transform your data using some features extractors method (FFT, MFCC, or MFSC). Then, use CNN as you use it to classify the image. You can assume the graphic of the sound wave as a 2D image.

But, If your data have a variety of length, you can combine CNN to detect each phoneme then combine it as a sequence using RNN or HMM. You can read this method also in the mentioned papers.

malioboro
  • 2,729
  • 3
  • 20
  • 46
  • Yes, I need to crop, I have thought the sound data in form of a dequeue, I would have a window of 2s, pass this 2 seconds to model, pop from the front of the window a sample, and push the sample recorded at the back and repeat this algorithm – Nimit Bhardwaj May 09 '19 at 04:57
  • One more thing I want to ask, say for "hello" word recognition, how much dataset size would provide a considerable good accuracy, if you know this. Otherwise i understood the logic, I would read the papers to implement it. Thanks – Nimit Bhardwaj May 09 '19 at 04:59
  • 2
    yes, you can use your "window" method. if you use small "shift", It would give some variation of your training data. I can't give an exact number, try with a smaller size (~100 samples for each class) and increase it if the accuracy is still small. To improve accuracy you still need to use some data that have similar sound with "hello". – malioboro May 09 '19 at 07:35