Sparse Training
Training with sparsity from the start – instead of "train dense, then prune," the model stays sparse from the beginning with connections dynamically added/removed.
Sparse Training keeps models sparse from the start and dynamically swaps connections – saves FLOPs during training itself, not just at inference.
Explanation
Methods like RigL (Evci et al., 2020) and SET (Mocanu et al., 2018) maintain fixed sparsity during training but regularly swap connections: unimportant ones are removed, promising ones added. This saves FLOPs during training itself.
Marketing Relevance
Sparse training promises efficiency not just at inference but also during training – potentially 10x cheaper LLM pre-training if hardware supports sparsity.
Example
RigL trains ResNet-50 at 90% sparsity and achieves 75% top-1 on ImageNet – same accuracy as dense training but with 5x fewer FLOPs during training.
Common Pitfalls
Current GPUs are poorly optimized for sparse training. Dynamic connection routing creates overhead. Still in early research for transformers/LLMs.
Origin & History
Mocanu et al. introduced SET (Sparse Evolutionary Training) in 2018. Evci et al. (Google, 2020) published RigL, matching dense training at 90% sparsity. NVIDIA researches hardware support with Ampere Sparse Tensor Cores.
Comparisons & Differences
Sparse Training vs. Post-Training Pruning
Post-training pruning removes weights after dense training; Sparse Training keeps the model sparse from the start.
Sparse Training vs. Lottery Ticket Hypothesis
Lottery Ticket finds sparse subnets through iterative prune-retrain; Sparse Training discovers them dynamically during a single training run.