Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence

    Gradient Checkpointing

    Also known as:
    Activation Checkpointing
    Rematerialization
    Memory-Efficient Training
    Updated: 2/9/2026

    Gradient checkpointing saves GPU memory by discarding intermediate activations and recomputing them during the backward pass – trades compute for memory.

    Quick Summary

    Gradient checkpointing discards activations and recomputes them during backward pass – saves ~60% GPU memory at the cost of ~30% more compute.

    Explanation

    Normally training stores all activations for backward pass (O(n) memory for n layers). Checkpointing stores only selected activations and recomputes the rest. Saves ~60-70% memory at ~30% more compute.

    Marketing Relevance

    Enables training models twice as large on the same GPU – standard for LLM training and fine-tuning.

    Origin & History

    Chen et al. (2016) formalized gradient checkpointing for deep networks. The technique became essential for training models that otherwise wouldn't fit in GPU memory. PyTorch and TensorFlow integrate it as standard feature. All modern LLM training runs use checkpointing.

    Comparisons & Differences

    Gradient Checkpointing vs. Gradient Accumulation

    Checkpointing saves activation memory (more compute); accumulation saves batch memory (slower training, same compute per sample).

    Gradient Checkpointing vs. Mixed Precision Training

    Checkpointing discards and recomputes; mixed precision halves memory by using FP16/BF16 instead of FP32.

    Related Services

    Related Terms

    Gradient AccumulationMixed PrecisionMemory OptimizationBackpropagation
    👋Questions? Chat with us!