For questions about audio processing tasks in the context of artificial intelligence.
Questions tagged [audio-processing]
38 questions
9
votes
1 answer
Is it possible to clean up an audio recording of a lecture using some type of AI system?
Is it possible to clean up an audio recording of a lecture from a smartphone (i.e. remove the background noise) using some type of AI system?

Thibault Molleman
- 99
- 1
- 1
- 3
5
votes
1 answer
How can I find a specific word in an audio file?
I'm trying to train and use a neural network to detect a specific word in an audio file. The input of the neural network is an audio of 2-3 seconds duration, and the neural network must determine whether the input audio (the voice of a person)…

Ali.kavari76
- 111
- 6
3
votes
2 answers
Can AI be used to reverse engineer a black box?
A while back I posted on the Reverse Engineering site about an audio DSP system whose designer had passed away and whose manufacturer no longer had source code (but the question was deleted). Basically, the audio filter settings are passed from a…

chmedly
- 131
- 2
2
votes
1 answer
Can I filter barking sounds on the television?
My dog goes bonkers every time the sound of a barking dog is heard on a television program. I never noticed this before but literally every movie or show with an outdoors setting eventually includes the sound of a barking dog.
Is it possible to…

AlanD
- 21
- 2
2
votes
0 answers
How to prepare audio data for deep learning?
Audio data is typically an array with the waveform represented by values from -1 to 1. There are two issues with that:
if all values are inverted, e.g. -1 becomes 1 and 1 becomes -1, the audio doesn't change. But if for example I need to find…

nikishev.
- 21
- 3
2
votes
2 answers
Is it realistic to train a transformer-based model (e.g. GPT) in a self-supervised way directly on the Mel spectrogram?
In music information retrieval, one usually converts an audio signal into some kind "sequence of frequency-vectors", such as STFT or Mel-spectrogram.
I'm wondering if it is a good idea to use the transformer architecture in a self-supervised manner…

Peter Franek
- 432
- 1
- 4
- 11
2
votes
0 answers
Model for direct audio-to-audio speech re-encoding
There are many resources available for text-to-audio (or vice versa) synthesis, for example Google's 'Wavenet'.
These tools do not allow the finer degree of control that may be required regarding the degree of inflections / tonality retained in…

NeverWasMyRealName
- 21
- 1
2
votes
1 answer
I want to determine how similar a given song is to Queen's songs. Am I headed in the right direction?
I've asked this question before (@ Reddit) and people suggested CNNs on a mel spectrogram more than anything else. This is great.
But I'm sort of stuck at: label some music data as "queen" and "not queen" and have this be the training set. Like,…

Mike Johnson Jr
- 121
- 1
2
votes
1 answer
How to get more accuracy of the logistic regression model?
I am working on a Baby Crying Detection model using logistic regression.
Out of $581$ audios, $222$ are of a baby crying. Each audio is of $5$ seconds.
what I have done is convert each audio into numbers. and those numbers go into a .csv file. so…

Muhammad Waqar Anwar
- 21
- 1
2
votes
0 answers
How do I train a multiple-speaker model (speech synthesis) based on Tacotron 2 and espnet?
I'm new to Speech Synthesis & Deep Learning. Recently, I got a task as described below:
I have problem in training a multi-speaker model which should be created by Tacotron2. And I was told I can get some ideas from espnet, which is a end-to-end…

Envelo Lee
- 21
- 1
2
votes
0 answers
State of the art in voice recognition
In the media there's lot of talk about face recognition, mainly with respect to identifying faces (= assigning to persons). Less attention is paid to the recognition of facially expressed emotions but there's a lot of research done into this…

Hans-Peter Stricker
- 811
- 1
- 8
- 20
2
votes
1 answer
How to use AI for language recognition?
Given an audio track, I'm trying to find a way to recognize the audio language. Only within a small set (e.g. English vs Spanish). Is there a simple solution to detect the language in a speech?

Tina J
- 973
- 6
- 13
1
vote
1 answer
How to combine input from different types of data sources?
I've to train a neural network using microphone data (wav files), accelerometer sensor data and light sensor data.
Right now the approach I thought was to convert all data into images and combine them into a single image and train my neural…

Aravind
- 113
- 1
- 5
1
vote
1 answer
Difficulty understanding Keras LSTM fitting data
I'm try to train a RNN with a chunk of audio data, where X and Y are two audio channels loaded into numpy arrays. The objective is to experiment with different NN designs to train them to transform single channel (mono) audio into a two channel…

Dmitry
- 19
- 2
1
vote
1 answer
What type of neural network architecture allows filtering out of unwanted sounds?
I have a use case where I will be inputting audio to a model, and the output of the model will be the same audio except with certain sounds removed (volume set to zero). The dataset is generated by taking an audio file, duplicating it, and then…

HonestMath
- 111
- 3