Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence

    Layer Dropping

    Also known as:
    Depth Pruning
    Layer Pruning
    Layer Removal
    Layer Skipping
    Updated: 2/11/2026

    A compression technique that removes entire transformer layers from a trained model – the simplest way to make an LLM smaller and faster.

    Quick Summary

    Layer Dropping removes entire transformer layers – the simplest way to speed up LLMs by 20-30% at only 2-5% quality loss.

    Explanation

    Studies show many middle transformer layers are redundant and can be removed with <5% quality loss. First and last layers are more critical. Layer dropping can work without retraining or be improved with short fine-tuning.

    Marketing Relevance

    Layer dropping is the "brute force" method of LLM compression: Remove 25% of layers, lose 2-5% quality, save 25% inference cost. Ideal for quick first optimizations.

    Example

    Men et al. (2024) showed Llama-2 70B with 20% fewer layers (56→45) loses only 3% quality – immediately 20% faster and cheaper.

    Common Pitfalls

    Not all layers equally removable – first/last layers are critical. Reasoning and math tasks are more affected. Without fine-tuning, unpredictable quality losses possible.

    Origin & History

    Fan et al. (2019) studied layer dropping for efficient transformer training. Sajjad et al. (2023) showed BERT layers can be systematically removed. Men et al. (2024, "ShortGPT") demonstrated this for LLMs.

    Comparisons & Differences

    Layer Dropping vs. Structured Pruning

    Structured pruning removes attention heads or FFN dimensions; layer dropping removes entire layers – coarser but simpler to implement.

    Layer Dropping vs. Knowledge Distillation

    Distillation trains a new model; layer dropping modifies the existing model by removing layers.

    Related Services

    Related Terms

    👋Questions? Chat with us!