Question 1

What is ZeRO (Zero Redundancy Optimizer)?

Accepted Answer

A memory optimization for distributed training that shards optimizer states, gradients, and parameters across GPUs instead of replicating – enables training of trillion-parameter models. ZeRO has 3 stages: ZeRO-1 (shard optimizer states, 4x memory reduction), ZeRO-2 (+gradients, 8x), ZeRO-3 (+parameters, linearly scalable). ZeRO-Infinity extends this to CPU/NVMe. Each GPU holds only 1/N of the data.

Question 2

How does ZeRO (Zero Redundancy Optimizer) work?

Accepted Answer

ZeRO has 3 stages: ZeRO-1 (shard optimizer states, 4x memory reduction), ZeRO-2 (+gradients, 8x), ZeRO-3 (+parameters, linearly scalable). ZeRO-Infinity extends this to CPU/NVMe. Each GPU holds only 1/N of the data.

Question 3

Why is ZeRO (Zero Redundancy Optimizer) important for marketing?

Accepted Answer

ZeRO revolutionized LLM training: Without ZeRO, training 100B+ models on standard GPU clusters would be impossible. Basis of DeepSpeed and PyTorch FSDP.

Question 4

How is ZeRO (Zero Redundancy Optimizer) used in practice?

Accepted Answer

Training a 13B model: Without ZeRO, each GPU needs ~52GB (model + optimizer). With ZeRO-3 on 8 GPUs, each needs only ~7GB – 8x more efficient.

Question 5

What are common mistakes with ZeRO (Zero Redundancy Optimizer)?

Accepted Answer

ZeRO-3 has higher communication overhead than ZeRO-1/2. ZeRO-Infinity is slow (CPU/NVMe). Configuration not trivial (stage choice, offloading options).

Question 6

Where does ZeRO (Zero Redundancy Optimizer) come from?

Accepted Answer

Rajbhandari et al. (Microsoft, 2020) published ZeRO as part of DeepSpeed. ZeRO-Infinity (2021) extended to CPU/NVMe offloading. PyTorch FSDP (2022) implemented ZeRO-3-like functionality natively. Today ZeRO is standard for every LLM training.

ZeRO (Zero Redundancy Optimizer)

Explanation

Marketing Relevance

Example

Common Pitfalls

Origin & History

Comparisons & Differences

ZeRO (Zero Redundancy Optimizer) vs. FSDP

ZeRO (Zero Redundancy Optimizer) vs. Data Parallelism (DDP)

Further Resources

Related Services

Related Terms