Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence

    FSDP (Fully Sharded Data Parallel)

    Also known as:
    Fully Sharded Data Parallel
    FSDP
    PyTorch FSDP
    FairScale FSDP
    Updated: 2/11/2026

    PyTorch's native implementation of parameter sharding – distributes model parameters, gradients, and optimizer states across GPUs for memory-efficient training.

    Quick Summary

    FSDP is PyTorch's native parameter sharding solution – each GPU holds only 1/N of parameters, enabling training of massive models without DeepSpeed.

    Explanation

    FSDP shards all model parameters: Each GPU holds only 1/N. Before each forward/backward, needed parameters are gathered via AllGather, released after computation. Conceptually identical to DeepSpeed ZeRO-3, but native to PyTorch.

    Marketing Relevance

    FSDP is the new standard for LLM training in PyTorch – replaces DDP for large models and provides memory efficiency without external libraries.

    Example

    Llama-2 training uses FSDP: A 70B model is sharded across 512 GPUs. Each GPU holds only ~280MB parameters instead of 140GB. Training scales nearly linearly.

    Common Pitfalls

    Configuration complex (sharding strategy, mixed precision, CPU offloading). Debugging harder than DDP. Not all custom layers are FSDP-compatible. Communication overhead for small models.

    Origin & History

    FairScale (Meta, 2021) brought the first FSDP implementation. PyTorch integrated FSDP natively in v1.12 (2022). FSDP2 (2024) simplified the API and improved performance. Meta uses FSDP for all Llama training runs.

    Comparisons & Differences

    FSDP (Fully Sharded Data Parallel) vs. DeepSpeed ZeRO

    FSDP: PyTorch-native, simpler integration. ZeRO: More features (ZeRO-Infinity, Expert Parallelism), better scaling beyond 1000 GPUs.

    Related Services

    Related Terms

    👋Questions? Chat with us!