HuBERT
HuBERT (Hidden-Unit BERT) is a self-supervised speech model from Meta that learns high-quality speech representations by predicting discretized audio clusters.
HuBERT learns universal audio representations through cluster prediction – the foundation for voice conversion, emotion detection, and speech processing.
Explanation
HuBERT masks audio frames and predicts cluster labels (similar to BERT for text). Clusters are iteratively created via K-Means on MFCC or model features.
Marketing Relevance
Foundation for voice conversion, emotion recognition, and speaker verification. HuBERT features are often used as universal audio embeddings.
Common Pitfalls
Iterative clustering increases training costs. Not as robust to noise as Whisper. Decoder architecture must be trained separately.
Origin & History
Hsu et al. (Meta, 2021) introduced HuBERT. It surpassed Wav2Vec 2.0 on multiple benchmarks. HuBERT-Soft and ContentVec extended it for voice conversion (RVC, so-vits-svc).
Comparisons & Differences
HuBERT vs. Wav2Vec 2.0
Wav2Vec uses contrastive loss; HuBERT uses cluster prediction – HuBERT is often more stable in training.
HuBERT vs. Whisper
Whisper is end-to-end supervised ASR; HuBERT provides universal features for many downstream tasks.