Gradient Checkpointing
Gradient checkpointing saves GPU memory by discarding intermediate activations and recomputing them during the backward pass – trades compute for memory.
Gradient checkpointing discards activations and recomputes them during backward pass – saves ~60% GPU memory at the cost of ~30% more compute.
Explanation
Normally training stores all activations for backward pass (O(n) memory for n layers). Checkpointing stores only selected activations and recomputes the rest. Saves ~60-70% memory at ~30% more compute.
Marketing Relevance
Enables training models twice as large on the same GPU – standard for LLM training and fine-tuning.
Origin & History
Chen et al. (2016) formalized gradient checkpointing for deep networks. The technique became essential for training models that otherwise wouldn't fit in GPU memory. PyTorch and TensorFlow integrate it as standard feature. All modern LLM training runs use checkpointing.
Comparisons & Differences
Gradient Checkpointing vs. Gradient Accumulation
Checkpointing saves activation memory (more compute); accumulation saves batch memory (slower training, same compute per sample).
Gradient Checkpointing vs. Mixed Precision Training
Checkpointing discards and recomputes; mixed precision halves memory by using FP16/BF16 instead of FP32.