Weight Sharing
A technique where multiple parts of a neural network use the same weights – significantly reducing parameter count and memory usage.
Weight Sharing lets multiple network parts use the same weights – ALBERT achieves BERT quality with 18x fewer parameters.
Explanation
Weight sharing is fundamental in CNNs (filters are shared across the image) and transformers (embedding/output layers share weights). ALBERT uses cross-layer weight sharing for 18x smaller models.
Marketing Relevance
Weight sharing enables more compact models with less overfitting risk. ALBERT proved cross-layer sharing achieves BERT quality with 18x fewer parameters.
Example
ALBERT shares weights across all 12 transformer layers: 12M parameters instead of 110M (BERT) with comparable quality.
Common Pitfalls
Too aggressive weight sharing limits model capacity. Not all architectures benefit equally. Can destabilize training.
Origin & History
Weight sharing in CNNs was used by LeCun for LeNet in 1989. In the transformer context, Press & Wolf (2017) popularized tied embeddings. ALBERT (Google, 2019) demonstrated cross-layer sharing.
Comparisons & Differences
Weight Sharing vs. Pruning
Pruning removes weights; Weight Sharing reduces the number of unique weights through reuse.
Weight Sharing vs. Knowledge Distillation
Distillation trains a new smaller model; Weight Sharing makes the existing model more compact through weight reuse.