0

In most of the speech compression using machine/deep learning, I have seen that, in order to process an audio file, we need to convert it into a mel spectrogram format, then this spectrogram is analyzed.

Do we apply the compression process on this mel spectrogram? How it is possible to compress audio with the image representation of an audio? Is it possible that we have to represent the audio in frequency format and here we represent higher frequency with brighter color in the image? Then as in JPEG compression, we can remove the higher frequency signal, which humans can't hear?

nbro
  • 39,006
  • 12
  • 98
  • 176
Nervous Hero
  • 145
  • 4

1 Answers1

1

The mel spectrogram is numbers. It’s not light, not photons. The plot is in the image, but the essence of the information is numbers.

One of the implications of compression is that there’s a way of contriving the data so you can retain the meaningful parts, and throw away the bulk of the meaningless parts. This suggests that in the mel spectrogram domain, there is a truncation based on intensity or a reduction of the domain that allows decent signal to be stored in many fewer bits.

If you were going to only do analysis within the mel spectrogram domain, then retaining only a portion in that format sounds useful.

EngrStudent
  • 361
  • 3
  • 12