Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence

    Weight Sharing

    Also known as:
    Parameter Sharing
    Shared Weights
    Tied Weights
    Updated: 2/9/2026

    A technique where multiple parts of a neural network use the same weights – significantly reducing parameter count and memory usage.

    Quick Summary

    Weight Sharing lets multiple network parts use the same weights – ALBERT achieves BERT quality with 18x fewer parameters.

    Explanation

    Weight sharing is fundamental in CNNs (filters are shared across the image) and transformers (embedding/output layers share weights). ALBERT uses cross-layer weight sharing for 18x smaller models.

    Marketing Relevance

    Weight sharing enables more compact models with less overfitting risk. ALBERT proved cross-layer sharing achieves BERT quality with 18x fewer parameters.

    Example

    ALBERT shares weights across all 12 transformer layers: 12M parameters instead of 110M (BERT) with comparable quality.

    Common Pitfalls

    Too aggressive weight sharing limits model capacity. Not all architectures benefit equally. Can destabilize training.

    Origin & History

    Weight sharing in CNNs was used by LeCun for LeNet in 1989. In the transformer context, Press & Wolf (2017) popularized tied embeddings. ALBERT (Google, 2019) demonstrated cross-layer sharing.

    Comparisons & Differences

    Weight Sharing vs. Pruning

    Pruning removes weights; Weight Sharing reduces the number of unique weights through reuse.

    Weight Sharing vs. Knowledge Distillation

    Distillation trains a new smaller model; Weight Sharing makes the existing model more compact through weight reuse.

    Related Services

    Related Terms

    👋Questions? Chat with us!