Wav2Vec
Wav2Vec is a self-supervised learning framework from Meta for speech representations that learns from raw audio and achieves state-of-the-art ASR with minimal labeled data.
Wav2Vec learns speech representations self-supervised from raw audio – enabling ASR with minimal labeling, ideal for rare languages.
Explanation
Wav2Vec 2.0 masks parts of the audio input and learns context vectors via contrastive loss. Then fine-tuned with CTC loss on labeled data. 10 minutes of labeled audio suffice for usable ASR.
Marketing Relevance
Democratizes ASR for low-resource languages: companies can build transcription for rare languages/dialects with minimal labeling.
Example
A company trains Wav2Vec 2.0 on 1000h unlabeled Swiss-German audio and fine-tunes with just 1h labeled data for dialect ASR.
Common Pitfalls
Pre-training requires large GPU resources. CTC decoding without language model produces errors. Less robust than Whisper with background noise.
Origin & History
Meta AI released Wav2Vec (2019) and Wav2Vec 2.0 (Baevski et al., 2020). It first showed that self-supervised pre-training for audio is as effective as BERT for text. HuBERT (2021) and data2vec followed.
Comparisons & Differences
Wav2Vec vs. Whisper
Wav2Vec is self-supervised (few labels needed); Whisper is supervised, trained on 680k hours of labeled audio.
Wav2Vec vs. HuBERT
Both are self-supervised; HuBERT uses offline clustering instead of contrastive loss and often achieves slightly better results.