Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence
    (Causal Masking (Kausale Maskierung))

    Causal Masking

    Also known as:
    Causal Attention
    Autoregressive Masking
    Causal Attention Mask
    Triangle Mask
    Updated: 2/9/2026

    Causal masking prevents tokens from attending to future positions – the technique enabling autoregressive generation in decoders like GPT.

    Quick Summary

    Causal masking blocks access to future tokens – the triangular matrix enabling autoregressive text generation in GPT, LLaMA, and all decoders.

    Explanation

    A lower triangular matrix masks attention scores: Position t can only see positions 1...t. Without causal masking, the model could "cheat" and read the answer from future tokens. Active in all GPT-like models (decoder-only).

    Marketing Relevance

    Fundamental concept behind every LLM: without causal masking, autoregressive text generation would be impossible.

    Origin & History

    Masked self-attention was introduced in the original Transformer (Vaswani et al., 2017) for the decoder. GPT-1 (2018) used exclusively causal masking (decoder-only architecture). BERT in contrast uses bidirectional attention without causal mask.

    Comparisons & Differences

    Causal Masking vs. Bidirektionale Attention (BERT)

    Causal masking: only previous tokens visible (generation); bidirectional: all tokens visible (understanding but no generation).

    Related Services

    Related Terms

    👋Questions? Chat with us!