ZeRO (Zero Redundancy Optimizer)
A memory optimization for distributed training that shards optimizer states, gradients, and parameters across GPUs instead of replicating – enables training of trillion-parameter models.
ZeRO shards optimizer states, gradients, and parameters across GPUs – eliminates redundancy and enables training of models that otherwise wouldn't fit in GPU memory.
Explanation
ZeRO has 3 stages: ZeRO-1 (shard optimizer states, 4x memory reduction), ZeRO-2 (+gradients, 8x), ZeRO-3 (+parameters, linearly scalable). ZeRO-Infinity extends this to CPU/NVMe. Each GPU holds only 1/N of the data.
Marketing Relevance
ZeRO revolutionized LLM training: Without ZeRO, training 100B+ models on standard GPU clusters would be impossible. Basis of DeepSpeed and PyTorch FSDP.
Example
Training a 13B model: Without ZeRO, each GPU needs ~52GB (model + optimizer). With ZeRO-3 on 8 GPUs, each needs only ~7GB – 8x more efficient.
Common Pitfalls
ZeRO-3 has higher communication overhead than ZeRO-1/2. ZeRO-Infinity is slow (CPU/NVMe). Configuration not trivial (stage choice, offloading options).
Origin & History
Rajbhandari et al. (Microsoft, 2020) published ZeRO as part of DeepSpeed. ZeRO-Infinity (2021) extended to CPU/NVMe offloading. PyTorch FSDP (2022) implemented ZeRO-3-like functionality natively. Today ZeRO is standard for every LLM training.
Comparisons & Differences
ZeRO (Zero Redundancy Optimizer) vs. FSDP
ZeRO is DeepSpeed's implementation; FSDP is PyTorch's native implementation of the same concept (parameter sharding).
ZeRO (Zero Redundancy Optimizer) vs. Data Parallelism (DDP)
DDP replicates everything on each GPU; ZeRO shards and gathers on demand – dramatically less memory.