Online Distillation
A distillation variant where multiple models train simultaneously and serve as teachers to each other – no pre-trained teacher needed.
Online distillation lets multiple models train simultaneously and serve as teachers to each other – eliminates the need for pre-trained teacher models.
Explanation
Deep Mutual Learning (Zhang et al., 2018): Two or more networks train in parallel, each learning from the soft labels of the others. No model needs pre-training. All models improve each other.
Marketing Relevance
Online distillation eliminates the need for large pre-trained teacher models – ideal for scenarios where no strong teacher model exists.
Example
Two ResNet-32 models train in parallel with mutual learning and outperform individually trained ResNet-32 – both models improve through mutual learning.
Common Pitfalls
Higher training compute (N models in parallel). Convergence can be unstable. Works best with 2-4 models, beyond that diminishing returns.
Origin & History
Zhang et al. (2018) introduced deep mutual learning. Anil et al. (Google, 2018) showed co-distillation for distributed training. The approach was further developed for federated learning and privacy-preserving scenarios.
Comparisons & Differences
Online Distillation vs. Knowledge Distillation
Standard KD: One pre-trained teacher, one student. Online: All models train and teach simultaneously.