Residual Connection
Residual connections add a layer's input to its output, allowing gradients to flow directly through deep networks.
Residual connections add input to output (y = f(x) + x) – the trick that makes training deep networks from ResNet to GPT possible.
Explanation
Formula: output = Layer(x) + x. The addition creates a gradient "shortcut." Without residual connections, deep networks (50+ layers) suffer from vanishing gradients. In Transformers: after every attention and FFN layer, combined with layer normalization.
Marketing Relevance
Without residual connections, neither deep CNNs (ResNet) nor Transformers with 100+ layers would be trainable.
Common Pitfalls
Dimensions must match (or projection needed). Combination with normalization critical (Pre-LN vs Post-LN). Can limit feature reuse.
Origin & History
He et al. (Microsoft, 2015) introduced residual connections in ResNet and won ImageNet. The Transformer paper (2017) adopted the concept as "Add & Norm" after each sub-layer. Standard in every deep learning architecture today.
Comparisons & Differences
Residual Connection vs. Dense Connections (DenseNet)
Residual adds input once; DenseNet concatenates outputs from all previous layers – more feature reuse but significantly more memory.