Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence

    Vocoder

    Also known as:
    Vocoder
    Neural Vocoder
    Waveform Generator
    Updated: 2/10/2026

    A vocoder converts Mel spectrograms or other acoustic features into audible audio waveforms – the final step in TTS pipelines.

    Quick Summary

    Vocoders convert Mel spectrograms into audible waveforms – HiFi-GAN and BigVGAN are the standards for natural speech synthesis.

    Explanation

    Neural vocoders (HiFi-GAN, WaveGlow, BigVGAN) generate high-quality audio from Mel spectrograms. They learn to reconstruct the missing phase information.

    Marketing Relevance

    Vocoder quality directly determines TTS naturalness. HiFi-GAN is the de facto standard for real-time synthesis.

    Common Pitfalls

    Artifacts on out-of-distribution input. Training data must match Mel spectrogram format. GPU needed for real-time.

    Origin & History

    The vocoder was invented in 1938 by Homer Dudley (Bell Labs). WaveNet (DeepMind, 2016) started neural vocoders. WaveRNN (2018), HiFi-GAN (2020), and BigVGAN (2023) made them real-time capable.

    Comparisons & Differences

    Vocoder vs. WaveNet

    WaveNet was the first neural vocoder (autoregressive, slow); HiFi-GAN uses GANs for real-time synthesis.

    Vocoder vs. Diffusion-based TTS

    Diffusion TTS (Grad-TTS) generates Mel specs directly; vocoders convert Mel specs→audio as a separate step.

    Related Services

    Related Terms

    👋Questions? Chat with us!