Question 1

What is Pre-LN vs. Post-LN?

Accepted Answer

Refers to the placement of layer normalization in Transformer blocks: Pre-LN normalizes before attention/FFN, Post-LN after. Post-LN (original Transformer): x → Attention → Add(x) → LN. Pre-LN (GPT-2+): x → LN → Attention → Add(x). Pre-LN trains more stably (no warmup needed), Post-LN often converges to better quality with careful training. Modern LLMs almost all use Pre-LN with RMSNorm.

Question 2

How does Pre-LN vs. Post-LN work?

Accepted Answer

Post-LN (original Transformer): x → Attention → Add(x) → LN. Pre-LN (GPT-2+): x → LN → Attention → Add(x). Pre-LN trains more stably (no warmup needed), Post-LN often converges to better quality with careful training. Modern LLMs almost all use Pre-LN with RMSNorm.

Question 3

Why is Pre-LN vs. Post-LN important for marketing?

Accepted Answer

The choice of Pre-LN vs Post-LN fundamentally affects training stability, required learning rate, and final model quality.

Question 4

What are common mistakes with Pre-LN vs. Post-LN?

Accepted Answer

Pre-LN can lead to representation collapse. Post-LN needs learning rate warmup. Incorrect switching can destabilize training.

Question 5

Where does Pre-LN vs. Post-LN come from?

Accepted Answer

The original Transformer (2017) used Post-LN. Xiong et al. (2020) showed Pre-LN trains more stably. GPT-2 (OpenAI, 2019) was one of the first large models with Pre-LN. Today: LLaMA, Mistral, Gemma use Pre-RMSNorm.

Question 6

What is the difference between Pre-LN vs. Post-LN and Layer Normalization?

Accepted Answer

Pre-LN vs. Post-LN and Layer Normalization are related concepts in AI and marketing. Refers to the placement of layer normalization in Transformer blocks: Pre-LN normalizes before atten...

Pre-LN vs. Post-LN

Explanation

Marketing Relevance

Common Pitfalls

Origin & History

Comparisons & Differences

Pre-LN vs. Post-LN vs. RMSNorm

Further Resources

Related Services

Related Terms