Mel Spectrogram
A Mel spectrogram is a visual representation of audio frequencies on the Mel scale – the standard input for modern speech and audio AI models.
Mel spectrograms convert audio into 2D images on the human hearing scale – the universal input for speech AI from Whisper to TTS.
Explanation
Audio is decomposed into frequency bins via STFT, projected onto the Mel scale (human hearing), and log-scaled. The result is a 2D "image" processed by CNNs or Transformers.
Marketing Relevance
Every modern audio ML system (Whisper, TTS, music generation) uses Mel spectrograms as intermediate representation.
Common Pitfalls
Information loss in Mel projection (phase info is lost). Parameters (n_mels, hop_length) must match the model. Back-conversion to audio needs a vocoder.
Origin & History
The Mel scale was developed in 1937 by Stevens, Volkmann & Newman. MFCCs dominated speech recognition 1980-2015. Mel spectrograms replaced MFCCs as deep learning input from ~2016 (Tacotron, WaveNet).
Comparisons & Differences
Mel Spectrogram vs. MFCC
MFCCs further compress Mel spectrograms via DCT; deep learning models prefer the full Mel spectrogram.
Mel Spectrogram vs. Raw Waveform
Raw waveforms are 1D signals; Mel spectrograms are 2D representations that make frequency patterns visible.