Prefix Caching
Prefix caching stores KV cache computations for frequently reused prompt prefixes (e.g., system prompts) and shares them between requests.
Prefix caching shares KV cache computations between requests with the same system prompt – saves up to 90% compute and costs for recurring prompts.
Explanation
When 100 requests use the same system prompt, its KV cache is computed only once and shared. Saves compute proportional to prefix length. Claude, GPT-4, and Gemini offer prompt caching as an API feature.
Marketing Relevance
Drastically reduces API costs and latency for repeated system prompts – especially valuable for chatbots, RAG, and agentic workflows.
Origin & History
vLLM implemented prefix caching in 2023 as Automatic Prefix Caching (APC). Anthropic introduced prompt caching for Claude in August 2024. Google followed with context caching for Gemini. OpenAI offered cached responses for GPT-4. 2025 prefix caching is standard across all major LLM APIs.
Comparisons & Differences
Prefix Caching vs. Standard KV-Cache
Standard KV cache is isolated per request; prefix caching shares cache between requests with same prefix.