In most of the speech compression using machine/deep learning, I have seen that, in order to process an audio file, we need to convert it into a mel spectrogram format, then this spectrogram is analyzed.
Do we apply the compression process on this mel spectrogram? How it is possible to compress audio with the image representation of an audio? Is it possible that we have to represent the audio in frequency format and here we represent higher frequency with brighter color in the image? Then as in JPEG compression, we can remove the higher frequency signal, which humans can't hear?