Less Context, Better Agents: What the Microsoft Context Engineering Study Means for Codex CLI Session Strategy
Less Context, Better Agents: What the Microsoft Context Engineering Study Means for Codex CLI Session Strategy
The instinct is always to keep everything. Every tool output, every file read, every shell result — surely the model needs the full picture to stay on track. A June 2026 study from Microsoft Research proves that instinct wrong, and the implications for Codex CLI configuration are immediate.
The Study: Pruning Beats Hoarding
Lodha et al.’s Less Context, Better Agents (arXiv:2606.10209) 1 evaluated four context-management strategies across a 50-task enterprise automation benchmark in Microsoft Dynamics 365, averaged over five independent runs per configuration:
| Strategy | Completion Rate | Tokens Used | Runtime |
|---|---|---|---|
| C1: No user model | 8.0% | 532,600 | 3.08 hrs |
| C2: Full history | 71.0% | 1,480,996 | 14.56 hrs |
| C3: Last-5 pruning | 79.0% | 535,274 | 5.39 hrs |
| C4: Pruning + summary | 91.6% | 553,374 | 5.79 hrs |
The headline: retaining full conversation history is actively harmful, not merely wasteful. C3 outperformed C2 by 8 percentage points whilst consuming 63.9% fewer tokens 1. Adding a lightweight summarisation layer (C4) recovered the awareness lost by pruning, lifting completion a further 12.6 points for just 3.4% additional token overhead 1.
Why Full History Hurts
The paper’s failure-mode taxonomy across 147 non-completions explains the mechanism:
- Stale-state references — the model acts on form fields or tool states that no longer exist. Full history (C2) produced 34 stale-state failures; pruning (C3) cut this to 6; summarisation (C4) reduced it to 4 1.
- Premature termination — the model loses track of what remains and declares victory early. Pruning alone increased this from 9 (C2) to 18 (C3), but summarisation collapsed it to 3 (C4) 1.
graph LR
A[Full History] -->|34 stale-state<br/>9 premature| B[High failure rate]
C[Recency Pruning] -->|6 stale-state<br/>18 premature| D[Moderate failure rate]
E[Pruning + Summary] -->|4 stale-state<br/>3 premature| F[Low failure rate]
style E fill:#2d6a4f,color:#fff
style F fill:#2d6a4f,color:#fff
The pattern is clear: stale context poisons reasoning more than missing context does — and a concise running summary is enough to prevent the awareness gaps that pure pruning introduces.
Cross-Model Validation
The study cross-validated with Claude Sonnet 4.5, which achieved 88.0% completion without any context engineering, rising to 94.5% with pruning plus summarisation 1. The delta is smaller because Sonnet does not exhibit GPT-5’s stalling behaviour under empty contexts, but the direction is identical: less context, better results, regardless of model family.
This matters because Codex CLI supports multiple model providers via Amazon Bedrock and named profiles 2. The finding generalises across providers.
Converging Evidence
The Microsoft study does not stand alone. Three complementary 2026 papers form a converging evidence base:
-
AdaCoM (arXiv:2605.30785) 3 trains an external context manager via reinforcement learning and demonstrates a Fidelity–Reliability trade-off: stronger agents benefit from high-fidelity context preservation, weaker agents need aggressive compression. This maps directly to Codex CLI’s named-profile architecture, where different models should receive different compaction thresholds.
-
SWE-Pruner (arXiv:2601.16746) 4 applies attention-guided context pruning to coding agents on SWE-bench, showing that selective pruning of repository context improves resolve rates whilst cutting token consumption.
-
SlopCodeBench (arXiv:2603.24755) 5 demonstrates that structural erosion rises in 77% of long-horizon coding trajectories, with agent code 2.3× more verbose than human-written equivalents — symptoms consistent with stale-context accumulation.
Mapping to Codex CLI Configuration
Codex CLI already implements the core primitives that the study validates. The gap is not tooling — it is tuning.
1. Recency-Based Pruning via tool_output_token_limit
The tool_output_token_limit key in config.toml caps tokens stored per individual tool output 6. The default is 16,000 tokens. For long-horizon sessions with verbose tool responses (test suites, large file reads, API responses), lowering this to 8,000–12,000 forces earlier summarisation of tool results:
# ~/.codex/config.toml — aggressive tool output pruning
tool_output_token_limit = 10000
This is the Codex CLI equivalent of the study’s “last-5 pruning” strategy: limiting how much raw tool state persists in the context window 6.
2. Compaction Threshold Tuning
The model_auto_compact_token_limit key controls when Codex CLI fires automatic history compaction 6. The default triggers at approximately 200,000 tokens. The study’s finding — that earlier, more aggressive context reduction improves outcomes — suggests lowering this:
# Trigger compaction earlier to prevent stale-state accumulation
model_auto_compact_token_limit = 120000
Codex CLI’s compaction fires at two points in the agent loop: a pre-turn trigger before sending a new user message, and a post-turn trigger after receiving the model response 7. Both checkpoints align with the study’s recommendation to prevent stale state from reaching the model.
3. Custom Compaction Prompts as Running Summaries
The study’s C4 strategy uses “a single free-form LLM pass” that condenses prior interactions into a running summary capturing forms opened, controls interacted with, and data entered 1. Codex CLI’s compact_prompt and experimental_compact_prompt_file keys 6 let you replicate this exactly:
# Point compaction at a custom summary-preserving prompt
experimental_compact_prompt_file = ".codex/compact-summary.md"
A well-crafted compaction prompt should instruct the model to preserve:
- Cumulative progress — what has been completed, what remains
- Active state — current file paths, branch names, test status
- Design commitments — architectural decisions made during the session
This mirrors the study’s summary agent output format: structured, factual, and free of raw tool dumps.
4. Named Profiles for Model-Specific Context Policies
The cross-model validation shows that context engineering gains vary by model: GPT-5 benefits enormously (8% → 91.6%), whilst Claude Sonnet 4.5 gains modestly (88% → 94.5%) 1. AdaCoM’s Fidelity–Reliability trade-off 3 reinforces this: weaker models need more aggressive compression.
Codex CLI named profiles let you encode model-specific context policies:
[profile.frontier]
model = "gpt-5.5"
model_auto_compact_token_limit = 150000
tool_output_token_limit = 16000
[profile.efficient]
model = "gpt-5-codex-mini"
model_auto_compact_token_limit = 80000
tool_output_token_limit = 8000
The efficient profile applies the study’s aggressive pruning strategy to the model that benefits most from it.
5. Session Forking as Architectural Pruning
The study’s most aggressive finding — that full history is actively detrimental — validates a session strategy already available in Codex CLI: forking. The codex --resume <id> and /fork commands create new sessions that inherit summary context without dragging stale tool output forward 8.
graph TD
S1[Session: scaffold project] -->|fork| S2[Session: implement feature A]
S1 -->|fork| S3[Session: implement feature B]
S2 -->|fork| S4[Session: integration tests]
S3 -->|fork| S4
style S1 fill:#1b4332,color:#fff
style S4 fill:#1b4332,color:#fff
Each fork is a natural pruning boundary. The forked session starts with a compacted summary of the parent — precisely the “pruning + summarisation” approach that achieved 91.6% completion in the study.
The Anti-Pattern: Maximising Context Window
The study directly challenges a common Codex CLI configuration pattern: setting model_context_window to the maximum supported value and model_auto_compact_token_limit as high as possible to “give the model more to work with.” The data shows this produces more stale-state failures, higher token costs, and longer runtimes — the worst of all worlds 1.
The correct default is the opposite: compact early, summarise aggressively, and fork sessions at natural task boundaries.
Practical Configuration Recipe
For teams adopting these findings immediately:
# ~/.codex/config.toml — evidence-based context engineering
# Compact before the window fills with stale state
model_auto_compact_token_limit = 120000
# Cap verbose tool outputs
tool_output_token_limit = 10000
# Use a summary-preserving compaction prompt
experimental_compact_prompt_file = ".codex/compact-summary.md"
# Track what this costs
# Use /usage in-session for daily/weekly/cumulative token views
Complement this with AGENTS.md guidance that instructs the agent to fork sessions at natural task boundaries rather than accumulating unbounded history:
## Session Hygiene
- Fork a new session after completing each discrete task
- Do not accumulate more than 3-4 tool-heavy operations in a single session
- When resuming work, fork from the last clean checkpoint rather than continuing
What Remains Open
The study benchmarked enterprise form automation, not software engineering. The failure modes (stale form states, duplicate field entries) are analogous to but not identical with coding failures (stale file contents, duplicate edits). The directional finding — pruning plus summarisation beats full history — aligns with SlopCodeBench and SWE-Pruner results in coding-specific contexts 4 5, but precise threshold values (last-5, window-3) will differ for code-generation workflows.
The 8.4% residual failure rate in C4 also signals that context engineering alone cannot reach 100% autonomy. Codex CLI’s approval modes and human-in-the-loop checkpoints remain necessary for production-critical workflows 9.
Citations
-
Lodha, A., Pahlavikhah Varnosfaderani, M., Chakraborty, A., & Mithal, A. (2026). Less Context, Better Agents: Efficient Context Engineering for Long-Horizon Tool-Using LLM Agents. arXiv:2606.10209. https://arxiv.org/abs/2606.10209 ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9
-
OpenAI. (2026). Configuration Reference — Codex. OpenAI Developers. https://developers.openai.com/codex/config-reference ↩
-
Learning Agent-Compatible Context Management for Long-Horizon Tasks. (2026). arXiv:2605.30785. https://arxiv.org/abs/2605.30785 ↩ ↩2
-
SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents. (2026). arXiv:2601.16746. https://arxiv.org/abs/2601.16746 ↩ ↩2
-
Orlanski, G. et al. (2026). SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks. arXiv:2603.24755. https://arxiv.org/abs/2603.24755 ↩ ↩2
-
OpenAI. (2026). Configuration Reference — Codex CLI. OpenAI Developers. https://developers.openai.com/codex/config-reference ↩ ↩2 ↩3 ↩4
-
OpenAI. (2026). Codex CLI Context Compaction: Architecture, Configuration, and Managing Long Sessions. Codex Knowledge Base. https://codex.danielvaughan.com/2026/03/31/codex-cli-context-compaction-architecture/ ↩
-
OpenAI. (2026). Session Persistence, Resume, Fork, and Analytics — Codex CLI. Codex Knowledge Base. https://codex.danielvaughan.com/2026/04/13/codex-cli-session-persistence-resume-fork-analytics/ ↩
-
OpenAI. (2026). Features — Codex CLI. OpenAI Developers. https://developers.openai.com/codex/cli/features ↩