5

I take a lot of memos by recoding my voice using my Android smartphone. The recordings can be a quick note, or a long dictation, so they vary a lot in size. Depending on the app I use, I the audio is saved as either a WAV or MP3 file.

What I want to do is take these voice memos and dictations ans convert them into text files.

I found this method that theoretically uses pavucontrol to pipe audio playback into Google Chrome's voice to text API, but I can't get it to work. I've followed the instructions and don't get any errors, I just don't see any text appear in the Chrome interface. In any case, it's not much better than holding my phone up to my laptop microphone. I was hoping for something where I wouldn't necessarily have to hear the audio as it was being converted to text, as I might do this with my laptop while I'm out at a coffee shop or something.

Ideally, there would be software where I could load a batch of sound files, and it would output a batch of text files, one for each audio file.

Does any software or method for this exist on Ubuntu?

Questioner
  • 6,839
  • Haven't looked into this in awhile, but it seems like the best (read: usable) speech to text algorithms are proprietary or patented. Makes it hard to do on Linux. Sphinx http://cmusphinx.sourceforge.net started along this road, but seemingly got derailed. Dragon makes a recorder app for Android, but then it has to be input into their proprietary app on Windows. – Joe Aug 11 '16 at 07:18
  • The method mentioned in the question works, but Google apparently has put a very very short limit (I could get only 22 English words transcribed) to ensure it's only used for demo purposes. In that case, it seems a mobile phone + pc combination is all there is at the moment... – Sadi Dec 14 '17 at 10:25
  • Did you use Chromium or Google Chrome? I could get the pavucontrol method to work in Chrome, but not Chromium. – dgo.a Dec 16 '17 at 17:53
  • 2

4 Answers4

2

Try Mozilla DeepSpeech. It's opensource tool for automatic transcription. But you will need to train the tool. You can download Mozilla's pre-trained model, or use Mozilla's Voice DataSets to create your own model, and you can use it for recordings in English. For very clear recordings, the accuracy rate is relatively good. but for my transcription projects, it was still not sufficient, as the recordings had lots of background noises, they were not of good quality, I used Transcribear instead, it's web based editor that allows for automatic transcription, but you will need to be connected online to upload recordings to the Transcribear server.

karel
  • 114,770
John
  • 71
  • 1
    Mozilla DeepSpeech hasn't seen any development ever since Mozilla fired the DeepSpeech team. See this issue for more details: https://github.com/mozilla/DeepSpeech/issues/3693 – Flimm Aug 23 '23 at 13:05
2

You can use OpenAI Whisper .

A volunteer named Gael LeGoff has packaged OpenAI Whisper for Snap. To install OpenAI Whisper using Snap, run:

sudo snap install whisper-gael

Now, to convert an audio file named audio.mp3 to text, run:

whisper-gael.whisper --model small --output_format txt --task transcribe audio.mp3

For better results, you can use the bigger models. The models to choose from are: tiny, base, small, medium, large, as well as tiny.en, base.en, small.en, and medium.en.

Flimm
  • 41,766
0

AutoSub is an open-source Python script to generate subtitle files (.srt, .vtt, and .txt transcript) for any video file using either Mozilla DeepSpeech or Coqui STT. They used open-source models to run inference on audio segments and pyAudioAnalysis to split the initial audio on silent segments, producing multiple smaller files (makes inference easy).

The main developer has also published an article about his work called: Generate Subtitles for any video file using Mozilla DeepSpeech.

Bob Ortiz
  • 101
0

You can use Speech Note.

Note taking, reading and translating with Speech to Text, Text to Speech and Machine Translation

Speech Note let you take, read and translate notes in multiple languages. It uses Speech to Text, Text to Speech and Machine Translation to do so. Text and voice processing take place entirely offline, locally on your computer, without using a network connection. Your privacy is always respected. No data is sent to the Internet.

Screenshot

Installation

Probably the easiest way to install it on Ubuntu is to get it from Flathub:

Download on Flathub

Flimm
  • 41,766