Progressive Shrinking
A training technique that progressively shrinks a large network – first kernel, then depth, then width – to train a supernet supporting many subnetworks.
Progressive Shrinking gradually reduces networks in kernel, depth, and width – the key technique enabling Once-for-All supernets.
Explanation
Progressive shrinking first trains the full model, then progressively co-trains smaller variants: Phase 1 (Elastic Kernel), Phase 2 (Elastic Depth), Phase 3 (Elastic Width). Each phase uses knowledge distillation from the full model.
Marketing Relevance
Central technique behind Once-for-All networks – enables training supernets that dynamically adapt to hardware constraints.
Example
In OFA, an ImageNet model is progressively shrunk: First smaller kernels (7→5→3) are trained, then layer drops, finally channel reductions. The result: one model, many deployment options.
Common Pitfalls
Complex multi-phase training pipeline. Order of shrinking dimensions matters. Requires careful hyperparameter tuning per phase.
Origin & History
Introduced by Cai et al. (2020) as the core method of the Once-for-All framework. Inspired by curriculum learning and gradual pruning (Zhu & Gupta, 2017).
Comparisons & Differences
Progressive Shrinking vs. One-Shot NAS
One-Shot NAS trains all subnets simultaneously; Progressive Shrinking introduces them gradually for more stable training.