Gradient Accumulation
Gradient accumulation sums gradients over multiple mini-batches before an optimization step – simulates larger batch sizes without more GPU memory.
Gradient accumulation simulates large batches by summing over mini-batches – trains models that otherwise wouldn't fit in GPU memory.
Explanation
Instead of batch size 32 on one GPU: accumulate 4 mini-batches of 8, then update. Effectively identical to batch 32, but only memory for 8 needed. Standard technique for fine-tuning on consumer GPUs.
Marketing Relevance
Enables training large models on small GPUs – essential for LoRA fine-tuning and edge ML.
Origin & History
The technique has existed since the early days of GPU training. It became increasingly important with the trend toward ever-larger models and limited consumer GPU memory (2020+). Tools like HuggingFace Trainer and DeepSpeed integrate gradient accumulation as a standard feature.
Comparisons & Differences
Gradient Accumulation vs. Gradient Checkpointing
Accumulation saves memory through smaller batches; checkpointing saves memory by recomputing activations instead of storing them.