Data Parallelism
The simplest form of distributed training: Each GPU holds a complete model copy and processes different data batches – gradients are synchronized.
Data parallelism replicates the model on every GPU and distributes data – simplest multi-GPU strategy with near-linear speedup.
Explanation
Each GPU processes a mini-batch, computes gradients locally, then gradients are averaged via AllReduce and all copies updated synchronously. Linearly scalable until communication becomes bottleneck. PyTorch DDP is the standard.
Marketing Relevance
Data parallelism is the default for multi-GPU training when the model fits on one GPU – simple, efficient, near-linear speedup.
Example
Fine-tuning a 7B LLM on 4 A100 GPUs: Each GPU holds the full model (14GB in FP16), processes batch size 8. Effective batch size: 32. Training 4x faster than single GPU.
Common Pitfalls
Model must fit entirely on each GPU. Redundant memory usage (N copies). Communication overhead with many GPUs. For very large models, FSDP/ZeRO is needed.
Origin & History
Data parallel training has existed since the 1990s. PyTorch DataParallel (DP) was the first simple implementation. PyTorch DDP (2019) improved efficiency through per-parameter AllReduce. Horovod (Uber, 2018) popularized ring AllReduce for efficient gradient synchronization.
Comparisons & Differences
Data Parallelism vs. Model Parallelism
Data parallel: Whole model on each GPU, data distributed. Model parallel: Model split across GPUs – needed when model > 1 GPU.
Data Parallelism vs. FSDP / ZeRO
DDP holds complete model copies; FSDP/ZeRO shard model parameters across GPUs – saves memory with same speedup.