Vocoder
A vocoder converts Mel spectrograms or other acoustic features into audible audio waveforms – the final step in TTS pipelines.
Vocoders convert Mel spectrograms into audible waveforms – HiFi-GAN and BigVGAN are the standards for natural speech synthesis.
Explanation
Neural vocoders (HiFi-GAN, WaveGlow, BigVGAN) generate high-quality audio from Mel spectrograms. They learn to reconstruct the missing phase information.
Marketing Relevance
Vocoder quality directly determines TTS naturalness. HiFi-GAN is the de facto standard for real-time synthesis.
Common Pitfalls
Artifacts on out-of-distribution input. Training data must match Mel spectrogram format. GPU needed for real-time.
Origin & History
The vocoder was invented in 1938 by Homer Dudley (Bell Labs). WaveNet (DeepMind, 2016) started neural vocoders. WaveRNN (2018), HiFi-GAN (2020), and BigVGAN (2023) made them real-time capable.
Comparisons & Differences
Vocoder vs. WaveNet
WaveNet was the first neural vocoder (autoregressive, slow); HiFi-GAN uses GANs for real-time synthesis.
Vocoder vs. Diffusion-based TTS
Diffusion TTS (Grad-TTS) generates Mel specs directly; vocoders convert Mel specs→audio as a separate step.