Pre-Training
The first training phase of an LLM where the model learns to understand and generate language from massive amounts of text (often trillions of tokens) – before specialized fine-tuning follows.
Pre-Training is the initial training of LLMs on trillions of tokens that builds world knowledge and language understanding – the most expensive and important phase.
Explanation
Pre-training uses self-supervised learning: The model learns to predict the next token (GPT-style) or reconstruct masked tokens (BERT-style). This creates a "foundation model" with broad world knowledge that can be adapted for many tasks.
Marketing Relevance
Pre-training explains why LLMs know so much: They have "read" the internet. Important for marketing: Model cutoff dates (knowledge only up to training time), and why fine-tuning on own data is often necessary.
Example
LLaMA 3 was pre-trained on 15 trillion tokens – equivalent to about 150 million books. This pre-training cost an estimated $100+ million in compute. The resulting base model can then be fine-tuned for specific tasks.
Common Pitfalls
Extremely expensive and resource-intensive. Quality depends on training data. Bias in data is learned. Cutoff date limits current knowledge.
Origin & History
Pre-training was established through Word2Vec (Mikolov 2013), then ELMo (2018), and BERT (Google 2018). GPT-3 (2020) showed that massive pre-training unlocks emergent capabilities.
Comparisons & Differences
Pre-Training vs. Fine-Tuning
Pre-Training builds general knowledge (trillions of tokens); Fine-Tuning specializes for tasks (thousands of examples).
Pre-Training vs. Continual Pre-Training
Standard Pre-Training is one-time; Continual Pre-Training updates models with new data without full retraining.