FSDP (Fully Sharded Data Parallel)
PyTorch's native implementation of parameter sharding – distributes model parameters, gradients, and optimizer states across GPUs for memory-efficient training.
FSDP is PyTorch's native parameter sharding solution – each GPU holds only 1/N of parameters, enabling training of massive models without DeepSpeed.
Explanation
FSDP shards all model parameters: Each GPU holds only 1/N. Before each forward/backward, needed parameters are gathered via AllGather, released after computation. Conceptually identical to DeepSpeed ZeRO-3, but native to PyTorch.
Marketing Relevance
FSDP is the new standard for LLM training in PyTorch – replaces DDP for large models and provides memory efficiency without external libraries.
Example
Llama-2 training uses FSDP: A 70B model is sharded across 512 GPUs. Each GPU holds only ~280MB parameters instead of 140GB. Training scales nearly linearly.
Common Pitfalls
Configuration complex (sharding strategy, mixed precision, CPU offloading). Debugging harder than DDP. Not all custom layers are FSDP-compatible. Communication overhead for small models.
Origin & History
FairScale (Meta, 2021) brought the first FSDP implementation. PyTorch integrated FSDP natively in v1.12 (2022). FSDP2 (2024) simplified the API and improved performance. Meta uses FSDP for all Llama training runs.
Comparisons & Differences
FSDP (Fully Sharded Data Parallel) vs. DeepSpeed ZeRO
FSDP: PyTorch-native, simpler integration. ZeRO: More features (ZeRO-Infinity, Expert Parallelism), better scaling beyond 1000 GPUs.