Gradient Clipping
Gradient clipping limits the norm or value of gradients during training to prevent exploding gradients.
Gradient clipping limits gradient norms and prevents exploding gradients – standard technique for stable LLM and transformer training.
Explanation
When the gradient norm exceeds a threshold, all gradients are proportionally scaled. Standard in LLM training (typical: max_norm=1.0). Two variants: clip by value and clip by norm.
Marketing Relevance
Essential for stable training of RNNs, transformers, and LLMs – without gradient clipping, training often diverges.
Origin & History
Pascanu et al. (2013) formalized gradient clipping for RNNs. With the rise of transformers and LLMs, gradient clipping (max_norm=1.0) became standard in all large training runs (GPT, LLaMA, etc.).
Comparisons & Differences
Gradient Clipping vs. Vanishing Gradient
Gradient clipping solves exploding gradients (too large); vanishing gradients (too small) need other solutions (skip connections, normalization).