Structured Pruning
A pruning variant that removes entire structures (neurons, filters, attention heads, layers) instead of individual weights – delivers real speedups without specialized sparse hardware.
Structured Pruning removes entire neurons, filters, or attention heads – delivers real speedups on standard hardware without sparse support.
Explanation
Unlike unstructured pruning (zeroing individual weights), structured pruning removes contiguous blocks: entire convolutional filters, attention heads, or even layers. The resulting model is a genuinely smaller model without sparse representation.
Marketing Relevance
Structured pruning is the most practically relevant pruning method since standard hardware (GPUs, CPUs) directly benefits from smaller models – no sparse support needed.
Example
LLM-Shearing (2023) selectively removes attention heads and FFN dimensions from Llama-2 7B, producing a 1.3B model that outperforms 1.3B models trained from scratch.
Common Pitfalls
Coarser granularity than unstructured pruning – may compress less. Harder to optimize which structures are removable. Requires retraining/fine-tuning after pruning.
Origin & History
Li et al. (2016) introduced filter pruning for CNNs. For transformers, head pruning was studied by Michel et al. (2019) – they showed many attention heads are removable. LLM-Shearing (2023) scaled this to LLMs.
Comparisons & Differences
Structured Pruning vs. Unstructured Pruning
Unstructured pruning removes individual weights (higher compression possible); Structured pruning removes entire blocks (real speedups on standard hardware).
Structured Pruning vs. Knowledge Distillation
Structured pruning trims an existing model; Distillation trains a new smaller model from scratch.