Don’t Break the Cache: What the Prompt Caching Research Means for Codex CLI Cost and Latency Optimisation

Every turn of a Codex CLI session resends the full conversation history to the API. Without mitigation, cost and latency grow quadratically with session length — a pattern that makes long-horizon agentic work financially and operationally painful. Prompt caching is the primary engineering lever that keeps the agent loop closer to linear. A January 2026 research paper, Don’t Break the Cache (Lumer et al., arXiv:2601.06007), systematically evaluated caching strategies across OpenAI, Anthropic, and Google, producing the first provider-comparative dataset for multi-turn agentic workflows ¹. This article maps its findings to Codex CLI configuration and practice.

How Prompt Caching Works in the Agent Loop

OpenAI’s prompt caching stores the computed key-value tensors behind a repeated prompt prefix so that the static portion of every API request — system prompt, tool definitions, AGENTS.md instructions, and reference documents — bills at a fraction of the normal input rate ². Caching is automatic for any prompt containing 1,024 tokens or more; there is no opt-in flag and no additional fee for cache writes ³.

The core constraint is exact prefix matching: the cached portion of a previous request must be an identical prefix of the new request ³. In practice, this means Codex CLI’s agent loop naturally benefits because each turn appends new messages (tool results, assistant responses) to the end of an ever-growing conversation array, preserving the prefix.

sequenceDiagram
    participant CLI as Codex CLI
    participant API as OpenAI API
    CLI->>API: Turn 1 [system + tools + user] (1,800 tokens)
    Note right of API: Cache MISS → store prefix
    API-->>CLI: Response
    CLI->>API: Turn 2 [system + tools + user + asst + tool_result + user] (3,200 tokens)
    Note right of API: Cache HIT on first 1,800 tokens
    API-->>CLI: Response (lower latency, 50% off cached portion)
    CLI->>API: Turn 3 [prefix grows to 5,100 tokens]
    Note right of API: Cache HIT on first 3,200 tokens
    API-->>CLI: Response

Cached input tokens appear in the API response under prompt_tokens_details.cached_tokens ³. Codex CLI v0.140.0 introduced /usage views that surface daily and weekly token activity, making it easier to spot whether your sessions are actually hitting the cache ⁴.

What the Research Found

Lumer et al. evaluated three caching strategies across over 500 agent sessions on DeepResearchBench, a multi-turn benchmark where agents autonomously execute web-search tool calls to answer complex research questions ¹:

Strategy	Cost Reduction	TTFT Improvement
Full context caching	41–80%	13–31%
System-prompt-only caching	41–80% (most consistent)	13–31%
Caching excluding dynamic tool results	41–80%	13–31%

The headline numbers — 41–80% cost reduction and 13–31% time-to-first-token improvement — held across all three providers ¹. But the paper’s most useful finding was a warning: naively caching everything, including tool-call results, paradoxically increased latency in some conditions ¹. The reason is that dynamic content injected into the middle of the cached prefix invalidates the cache from that point forward, forcing re-computation of all subsequent tokens.

The recommended strategy: cache only the static prefix (system prompt, tool definitions, reference documents) and append dynamic content (tool results, user messages) strictly at the end ¹.

Mapping to Codex CLI Configuration

Codex CLI’s architecture already follows this pattern by default — the system prompt and tool schema form a stable prefix, and conversation turns append to the end. But several configuration levers can either preserve or destroy cache efficiency.

1. Keep AGENTS.md Stable Within a Session

The contents of your project’s AGENTS.md file are injected into the system prompt ⁵. If you edit AGENTS.md mid-session, the system prompt changes, invalidating the cached prefix for every subsequent turn. For long-horizon work, treat AGENTS.md as immutable during a session.

2. Tune `model_auto_compact_token_limit` Carefully

When the conversation history exceeds this threshold, Codex CLI triggers automatic compaction — summarising older turns to free context space ⁶. Compaction rewrites the conversation history, which destroys the cached prefix. Setting this value too low forces frequent compactions and frequent cache invalidations. Setting it too high risks hitting the model’s context window limit.

# config.toml — balance cache preservation against context pressure
[model]
model_auto_compact_token_limit = 100000  # default varies by model

A practical heuristic: set the threshold at roughly 70–80% of the model’s context window. This gives enough headroom for several turns before compaction triggers, maximising the number of cache-hit turns between compaction events.

3. Control `tool_output_token_limit`

Large tool outputs (e.g., a 50,000-token file read) inflate the conversation history rapidly, accelerating the path to compaction ⁶. Bounding individual tool outputs keeps the history growth linear and delays cache-breaking compaction events.

[model]
tool_output_token_limit = 16000  # cap individual tool outputs

4. Use `compact_prompt` to Preserve Key Context

When compaction does fire, the default summarisation prompt may discard information you need. A custom compact_prompt can instruct the summariser to preserve critical architectural context while still reducing token count ⁶:

[model]
compact_prompt = "Summarise the conversation history, preserving: (1) all file paths modified, (2) test results, (3) architectural decisions. Discard verbose tool outputs."

5. Stabilise Tool Definitions Across Turns

The Don’t Break the Cache paper found that varying tool definitions between requests breaks the cache ¹. In Codex CLI, MCP server configurations define additional tools. If an MCP server’s tool schema changes mid-session (e.g., a server restart with updated capabilities), the tool-definition portion of the prefix changes, invalidating the cache. Pin MCP server versions and avoid hot-reloading tool schemas during long sessions.

The Compaction–Cache Trade-off

graph LR
    A[Session starts] --> B[Turns accumulate]
    B --> C{Context near limit?}
    C -- No --> D[Cache HIT on growing prefix]
    D --> B
    C -- Yes --> E[Compaction fires]
    E --> F[History rewritten]
    F --> G[Cache MISS — new prefix]
    G --> B

Every compaction event resets the cache. The research implies that the optimal strategy is to delay compaction as long as possible while keeping individual turn sizes small ¹. This is where tool_output_token_limit and bounded tool calls pay dividends — they slow context growth without rewriting the prefix.

Codex CLI v0.141.0 reinforced this pattern by “caching tool search and eliminating repeated request and history copies,” reducing both latency and memory consumption in tool-heavy sessions ⁷.

Named Profiles for Cache-Aware Routing

Different task types have different cache profiles. A quick code review might complete in 3–5 turns, barely warming the cache. A multi-hour refactoring session might run 50+ turns, where cache efficiency dominates cost.

# Named profile for long-horizon work — maximise cache hits
[profiles.marathon]
model = "o3"
model_auto_compact_token_limit = 120000
tool_output_token_limit = 12000

# Named profile for quick tasks — cache less important
[profiles.sprint]
model = "gpt-5.5"
model_auto_compact_token_limit = 60000
tool_output_token_limit = 24000

The marathon profile delays compaction aggressively on a high-context model, preserving the cache across many turns. The sprint profile prioritises richer tool outputs for short, intensive sessions where cache hit rates will be low regardless.

Quantifying the Impact

The paper’s 41–80% cost reduction aligns with real-world Codex CLI observations. ProjectDiscovery’s Neo agent documented a 59% cumulative cost drop from prompt caching alone, reaching over 90% on fully optimised paths ⁸. OpenAI’s pricing structure amplifies the effect: cached input tokens cost 50% less than standard input tokens, and for extended caching (available on gpt-5.5, gpt-5.4, and later models), cached tokens persist for up to 24 hours in GPU-local storage ³.

For a concrete example: a 40-turn Codex CLI session with a 3,000-token system prompt and average 2,000-token tool outputs generates roughly 1.6 million input tokens without caching. With effective prefix caching hitting on 70% of cumulative input tokens, the billable input cost drops by approximately 35% — and latency drops measurably on every cache-hit turn.

Cache Lifetime and Session Pacing

OpenAI’s cache evicts prefixes after 5–10 minutes of inactivity, with a maximum lifetime of roughly one hour during off-peak periods ³. Extended caching on newer models stretches this to 24 hours ³. The practical implication: if you pause a Codex CLI session for lunch, expect a cache miss on the first turn back. The /usage command can confirm whether cached tokens are appearing in your session’s billing breakdown ⁴.

The paper also noted a rate-limit concern: keeping request frequency below approximately 15 requests per minute per unique prefix-key combination avoids cache overflow onto additional inference engines ¹ ³. Codex CLI’s natural pacing — with human think-time between turns — rarely hits this threshold in interactive mode, but automated codex exec pipelines running tight loops should be aware.

Practical Checklist

Do not edit AGENTS.md mid-session — it changes the system prompt prefix
Set model_auto_compact_token_limit to 70–80% of context window — delays cache-breaking compaction
Bound tool_output_token_limit — slows context growth, extends cache lifetime
Pin MCP server versions — prevents tool-schema changes that break the prefix
Use /usage to monitor cached_tokens — verify cache hits are occurring
Avoid codex exec loops faster than 15 req/min on the same prefix — prevents cache overflow
Accept the post-compaction cache miss — it is the cost of staying within context limits; structure work to minimise compaction frequency

Citations

Lumer, E., Nizar, F., Jangiti, A., Frank, K., Gulati, A., Phadate, M., & Subbiah, V. K. (2026). “Don’t Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks.” arXiv:2601.06007. https://arxiv.org/abs/2601.06007 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸
OpenAI. “Prompt Caching.” OpenAI Blog. https://openai.com/index/api-prompt-caching/ ↩
OpenAI. “Prompt Caching — API Guide.” OpenAI Developers. https://developers.openai.com/api/docs/guides/prompt-caching ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷
OpenAI. “Codex CLI Changelog — v0.140.0.” OpenAI Developers. https://developers.openai.com/codex/changelog?type=codex-cli ↩ ↩²
OpenAI. “Best Practices — Codex.” OpenAI Developers. https://developers.openai.com/codex/learn/best-practices ↩
OpenAI. “Configuration Reference — Codex.” OpenAI Developers. https://developers.openai.com/codex/config-reference ↩ ↩² ↩³
OpenAI. “Codex CLI Changelog — v0.141.0.” OpenAI Developers. https://developers.openai.com/codex/changelog?type=codex-cli ↩
“Prompt Caching in 2026: Cut LLM Costs, Keep Quality.” Digital Applied. https://www.digitalapplied.com/blog/prompt-caching-2026-cut-llm-costs-engineering-guide ↩

Don’t Break the Cache: What the Prompt Caching Research Means for Codex CLI Cost and Latency Optimisation

How Prompt Caching Works in the Agent Loop

What the Research Found

Mapping to Codex CLI Configuration

1. Keep AGENTS.md Stable Within a Session

2. Tune model_auto_compact_token_limit Carefully

3. Control tool_output_token_limit

4. Use compact_prompt to Preserve Key Context

5. Stabilise Tool Definitions Across Turns

The Compaction–Cache Trade-off

Named Profiles for Cache-Aware Routing

Quantifying the Impact

Cache Lifetime and Session Pacing

Practical Checklist

Citations

2. Tune `model_auto_compact_token_limit` Carefully

3. Control `tool_output_token_limit`

4. Use `compact_prompt` to Preserve Key Context