Layer Dropping
A compression technique that removes entire transformer layers from a trained model – the simplest way to make an LLM smaller and faster.
Layer Dropping removes entire transformer layers – the simplest way to speed up LLMs by 20-30% at only 2-5% quality loss.
Explanation
Studies show many middle transformer layers are redundant and can be removed with <5% quality loss. First and last layers are more critical. Layer dropping can work without retraining or be improved with short fine-tuning.
Marketing Relevance
Layer dropping is the "brute force" method of LLM compression: Remove 25% of layers, lose 2-5% quality, save 25% inference cost. Ideal for quick first optimizations.
Example
Men et al. (2024) showed Llama-2 70B with 20% fewer layers (56→45) loses only 3% quality – immediately 20% faster and cheaper.
Common Pitfalls
Not all layers equally removable – first/last layers are critical. Reasoning and math tasks are more affected. Without fine-tuning, unpredictable quality losses possible.
Origin & History
Fan et al. (2019) studied layer dropping for efficient transformer training. Sajjad et al. (2023) showed BERT layers can be systematically removed. Men et al. (2024, "ShortGPT") demonstrated this for LLMs.
Comparisons & Differences
Layer Dropping vs. Structured Pruning
Structured pruning removes attention heads or FFN dimensions; layer dropping removes entire layers – coarser but simpler to implement.
Layer Dropping vs. Knowledge Distillation
Distillation trains a new model; layer dropping modifies the existing model by removing layers.