Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence

    Neural Audio Codec

    Also known as:
    Neural Audio Codec
    EnCodec
    Audio Tokenizer
    SoundStream
    Updated: 2/10/2026

    Neural Audio Codecs compress audio into discrete tokens – the bridge between audio and language models that enables music and speech generation.

    Quick Summary

    Neural Audio Codecs (EnCodec, SoundStream) convert audio into discrete tokens – the foundation for LLM-based music and speech generation.

    Explanation

    EnCodec (Meta) and SoundStream (Google) use encoder-decoder with Residual Vector Quantization (RVQ). Audio is converted into token sequences that LLMs can process like text.

    Marketing Relevance

    Enables AudioLMs: Without audio tokenization, LLMs couldn't generate music or speech. Foundation for MusicGen, VALL-E, and AudioPaLM.

    Common Pitfalls

    Low bitrate → quality loss. RVQ depth vs. latency tradeoff. Codebook collapse with poor training.

    Origin & History

    SoundStream (Google, 2021) and EnCodec (Meta, 2022) started neural audio compression. These codecs enabled AudioLM (2022), MusicGen (2023), and VALL-E (2023) – the first generation of LLM audio.

    Comparisons & Differences

    Neural Audio Codec vs. Traditional Codec (MP3, AAC)

    Traditional codecs compress by psychoacoustic rules; neural codecs learn compression and produce discrete tokens.

    Neural Audio Codec vs. Mel Spectrogram

    Mel spectrograms are continuous 2D representations; neural codec tokens are discrete and processable by LLMs.

    Related Services

    Related Terms

    👋Questions? Chat with us!