What is easier or more efficient to summarize voice or text? [DP/RN]

Question

If possible consider the relationship between implementation difficulty and accuracy in voice examples or simply chat conversations.

And currently, what are the directions on algorithms like Deep Learning or others to solve this.

Brian O'Donnell · Accepted Answer · 2018-02-24T22:58:21.840

Summarizing text is always going to be 'easier or more efficient' than voice simply because voice requires the additional step of converting to text. That doesn't tell you anything about accuracy.

From an article published on June 1, 2017, Google’s speech recognition is now almost as accurate as humans: "According to Mary Meeker’s annual Internet Trends Report, Google’s machine learning-backed voice recognition — as of May 2017 — has achieved a 95% word accuracy rate for the English language. That current rate also happens to be the threshold for human accuracy."

If you need this kind of accuracy check out Google's Cloud Speech API. There is even a speech to text feature on the web page.

Given a speech-to-text conversion accuracy of 95%, voice will be 5% less accurate than text if everything else is equal but it usually isn't. People generally write better text, such as in documents or emails, than when they speak unless of course they are giving a formal lecture, or talking in a formal meeting. If one is analyzing text messages, Tweets, or threads found in typical informal forums, you will find very poor quality in grammar, spelling, vocabulary, and punctuation. The answer to your question will depend on the source of your text.

In another article, dated November 13, 2017, Why 100% Accuracy Is Not Available With Speech Recognition Software Alone, the author gives some reasons, albeit for transcription software which has a special purpose, why there will always be some errors due to:

Speech Patterns and Accents - Regional variations exist, for example English speakers in Boston sound different than Kentucky. How does the software handle slurred speech or when a person blends their words?
Grammar and Punctuation - speech recognition software doesn't know where a period, comma, or semi-colon belongs
Homonyms and unusual words - "Speech processing software can only recognize words and phrases that it has specifically been trained to recognize."
Ambient Noise, Overlapping Speech, and Number of Speakers

To address your last question about where the technology is going...
Four days ago a paper by Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria entitled Recent Trends in Deep Learning Based Natural Language Processing was published which gives some of the answers.

From the 'Conclusion' section: With distributed representation, various deep models have become the new state-of-the-art methods for NLP problems. Supervised learning is the most popular practice in recent deep learning research for NLP. In many real-world scenarios, however, we have unlabeled data which require advanced unsupervised or semi-supervised approaches. In cases where there is lack of labeled data for some particular classes or the appearance of a new class while testing the model, strategies like zero-shot learning should be employed. These learning schemes are still in their developing phase but we expect deep learning based NLP research to be driven in the direction of making better use of unlabeled data. We expect such trend to continue with more and better model designs. We expect to see more NLP applications that employ reinforcement learning methods, e.g., dialogue systems. We also expect to see more research on multimodal learning [167] as, in the real world, language is often grounded on (or correlated with) other signals.

Finally, we expect to see more deep learning models whose internal memory (bottom-up knowledge learned from the data) is enriched with an external memory (top-down knowledge inherited from a KB). Coupling symbolic and sub-symbolic AI will be key for stepping forward in the path from NLP to natural language understanding. Relying on machine learning, in fact, is good to make a ‘good guess’ based on past experience, because sub-symbolic methods encode correlation and their decision-making process is probabilistic.

Hi, thanks for the explanation this help me so much! – Eric Saboia Mar 05 '18 at 10:47 — Eric Saboia, Mar 05 '18 at 10:47

score 0 · Answer 2 · answered Feb 24 '18 at 22:02

You might want to take the Stanford Online course on YouTube Natural Language Processing with Deep Learning. This course will give you insight into how different kinds of neural networks can be used for different kind of NLP tasks.

In my opinion, you can use Gated Recurrent Units (GRUs) to encode and decode text. Of course, text will be easier because voice data, as it is stored in a computer, is going to be difficult to interpret in the testing phase. Another way is to get most impactful words and then use these to form sentences regarding the original text.

You can also start by looking out for publications related to text summarizers. For example, Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond, will get you started. You can use this as a starting point. In case you need to understand basics regarding the underlying techniques, then you can go through references in this paper and find out useful resources to get you started.

Hi, thanks for the suggestion and the answer – Eric Saboia Mar 05 '18 at 10:48 — Eric Saboia, Mar 05 '18 at 10:48

What is easier or more efficient to summarize voice or text? [DP/RN]

2 Answers2