Prompt Caching in LLMs: How Reusing Context Cuts Cost and Latency
Prompt caching is a technique where LLM providers store the computed representation (KV cache) of repeated prompt prefixes so they don't need to be reprocessed on each request. When consecutive requests share the same system prompt, instructions, or context, the cached computation is reused — reducing cost (Anthropic charges 90% less for cached tokens) and latency (skipping the prefill computation for cached portions). This is why tools like Claude Code that send large, stable system prompts benefit from staying on the same provider's caching infrastructure.
Prompt caching is an optimization technique used by LLM providers where the computed internal representation of frequently repeated prompt sections is stored and reused across requests, rather than being recomputed from scratch each time. ## How It Works When an LLM processes a prompt, it computes a KV cache (key-value cache) — the internal attention state for each token. This computation is the expensive "prefill" step of inference. If the next request starts with the same prefix (e.g., the same system prompt and context), the provider can skip recomputing that portion and load the cached KV state directly. ## Cost and Latency Impact **Anthropic's implementation:** Cached input tokens are charged at 90% less than uncached tokens, and the prefill latency for cached portions drops to near-zero. For applications like Claude Code that send large, stable system prompts (often 10,000+ tokens of instructions and context) on every request, caching can reduce effective costs by 50-80%. **Cache TTL:** Anthropic's prompt cache has a 5-minute time-to-live (TTL). If more than 5 minutes pass between requests with the same prefix, the cache expires and the full computation must be repeated. This is why maintaining a steady request cadence matters for cost efficiency. ## Why It Matters for Agent Harnesses The Anthropic vs OpenClaw controversy in April 2026 was partly about caching: Anthropic argued that Claude Code and Claude Co-Work were engineered to maximize prompt cache hit rates, while third-party harnesses like OpenClaw either bypassed, ignored, or underutilized caching mechanisms — making the same workloads dramatically more expensive for Anthropic to serve. When a third-party tool restructures prompts differently than the provider's own tools, cache hit rates drop toward zero, and every request incurs full prefill cost. This is a real economic difference — not just a pricing dispute. ## Broader Context Prompt caching is one instance of a general principle: LLM inference costs are dominated by the prefill step (processing input tokens), not the generation step (producing output tokens). Any technique that reduces redundant prefill — caching, KV cache compression, or simply writing shorter prompts — directly reduces cost and latency. Anthropic vs OpenClaw: The Third-Party Harness Lockout of April 2026 TurboQuant Reality Check: Google's KV Cache Compression Claims vs Actual Improvement Caveman Skill and the Brevity Research: 65-75% Token Reduction That Improves LLM Accuracy