Self-Distillation
A variant of knowledge distillation where a model uses itself as teacher – the same or identical model serves as teacher for a new training run.
Self-distillation uses a model as its own teacher – improves quality without a larger teacher model, basis for DINO and modern vision foundation models.
Explanation
Born-Again Networks (Furlanello et al., 2018) showed: A student with identical architecture as the teacher can surpass the teacher. DINO (Caron et al., 2021) uses self-distillation with a momentum teacher for self-supervised vision learning.
Marketing Relevance
Self-distillation improves models without larger teacher models – ideal when no stronger model is available. Basis for DINO, DINOv2, and modern vision foundation models.
Example
DINO trains a Vision Transformer with self-distillation: The student sees small image crops, the teacher (exponential moving average) sees the full image. Result: State-of-the-art features without labels.
Common Pitfalls
Improvements are smaller than teacher-student with larger teacher. Can lead to overfitting on own mistakes. Momentum hyperparameters critical for stability.
Origin & History
Furlanello et al. (2018) showed with "Born-Again Networks" that self-distillation can surpass the teacher. Caron et al. (2021) revolutionized self-supervised learning with DINO. DINOv2 (2023) scaled the approach to one of the best vision foundation models.
Comparisons & Differences
Self-Distillation vs. Knowledge Distillation
Standard distillation uses a larger teacher model; self-distillation uses an equally sized or identical model as teacher.