Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence

    Self-Distillation

    Also known as:
    Born-Again Networks
    Self-Training Distillation
    Internal Distillation
    Updated: 2/11/2026

    A variant of knowledge distillation where a model uses itself as teacher – the same or identical model serves as teacher for a new training run.

    Quick Summary

    Self-distillation uses a model as its own teacher – improves quality without a larger teacher model, basis for DINO and modern vision foundation models.

    Explanation

    Born-Again Networks (Furlanello et al., 2018) showed: A student with identical architecture as the teacher can surpass the teacher. DINO (Caron et al., 2021) uses self-distillation with a momentum teacher for self-supervised vision learning.

    Marketing Relevance

    Self-distillation improves models without larger teacher models – ideal when no stronger model is available. Basis for DINO, DINOv2, and modern vision foundation models.

    Example

    DINO trains a Vision Transformer with self-distillation: The student sees small image crops, the teacher (exponential moving average) sees the full image. Result: State-of-the-art features without labels.

    Common Pitfalls

    Improvements are smaller than teacher-student with larger teacher. Can lead to overfitting on own mistakes. Momentum hyperparameters critical for stability.

    Origin & History

    Furlanello et al. (2018) showed with "Born-Again Networks" that self-distillation can surpass the teacher. Caron et al. (2021) revolutionized self-supervised learning with DINO. DINOv2 (2023) scaled the approach to one of the best vision foundation models.

    Comparisons & Differences

    Self-Distillation vs. Knowledge Distillation

    Standard distillation uses a larger teacher model; self-distillation uses an equally sized or identical model as teacher.

    Related Services

    Related Terms

    👋Questions? Chat with us!