Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence

    Wav2Vec

    Also known as:
    Wav2Vec 2.0
    Self-Supervised Speech
    Meta Speech Model
    Updated: 2/10/2026

    Wav2Vec is a self-supervised learning framework from Meta for speech representations that learns from raw audio and achieves state-of-the-art ASR with minimal labeled data.

    Quick Summary

    Wav2Vec learns speech representations self-supervised from raw audio – enabling ASR with minimal labeling, ideal for rare languages.

    Explanation

    Wav2Vec 2.0 masks parts of the audio input and learns context vectors via contrastive loss. Then fine-tuned with CTC loss on labeled data. 10 minutes of labeled audio suffice for usable ASR.

    Marketing Relevance

    Democratizes ASR for low-resource languages: companies can build transcription for rare languages/dialects with minimal labeling.

    Example

    A company trains Wav2Vec 2.0 on 1000h unlabeled Swiss-German audio and fine-tunes with just 1h labeled data for dialect ASR.

    Common Pitfalls

    Pre-training requires large GPU resources. CTC decoding without language model produces errors. Less robust than Whisper with background noise.

    Origin & History

    Meta AI released Wav2Vec (2019) and Wav2Vec 2.0 (Baevski et al., 2020). It first showed that self-supervised pre-training for audio is as effective as BERT for text. HuBERT (2021) and data2vec followed.

    Comparisons & Differences

    Wav2Vec vs. Whisper

    Wav2Vec is self-supervised (few labels needed); Whisper is supervised, trained on 680k hours of labeled audio.

    Wav2Vec vs. HuBERT

    Both are self-supervised; HuBERT uses offline clustering instead of contrastive loss and often achieves slightly better results.

    Related Services

    Related Terms

    👋Questions? Chat with us!