Causal Masking
Causal masking prevents tokens from attending to future positions – the technique enabling autoregressive generation in decoders like GPT.
Causal masking blocks access to future tokens – the triangular matrix enabling autoregressive text generation in GPT, LLaMA, and all decoders.
Explanation
A lower triangular matrix masks attention scores: Position t can only see positions 1...t. Without causal masking, the model could "cheat" and read the answer from future tokens. Active in all GPT-like models (decoder-only).
Marketing Relevance
Fundamental concept behind every LLM: without causal masking, autoregressive text generation would be impossible.
Origin & History
Masked self-attention was introduced in the original Transformer (Vaswani et al., 2017) for the decoder. GPT-1 (2018) used exclusively causal masking (decoder-only architecture). BERT in contrast uses bidirectional attention without causal mask.
Comparisons & Differences
Causal Masking vs. Bidirektionale Attention (BERT)
Causal masking: only previous tokens visible (generation); bidirectional: all tokens visible (understanding but no generation).