Cross-Attention
Cross-attention computes attention between two different sequences – e.g., between text conditioning and image generation in diffusion models.
Cross-attention connects two sequences – the mechanism linking text prompts with image generation and enabling multimodal AI.
Explanation
Queries come from one sequence, keys/values from another. In encoder-decoder models: decoder attends to encoder output. In Stable Diffusion: image latents (query) attend to text embeddings (key/value). Unlike self-attention where Q, K, V come from the same sequence.
Marketing Relevance
Key mechanism for multimodal AI: connects text with image, audio with text, instructions with code.
Origin & History
Cross-attention was part of the original Transformer (Vaswani et al., 2017) as encoder-decoder attention. Stable Diffusion (2022) used cross-attention for text-to-image conditioning and made the concept central in generative AI. ControlNet and IP-Adapter build on cross-attention.
Comparisons & Differences
Cross-Attention vs. Self-Attention
Self-attention: Q, K, V from same sequence (internal context); cross-attention: Q from one sequence, K/V from another (external information).