The Reasoning Trap: Why Higher Reasoning Effort Increases Tool Hallucination and How to Defend Your Codex CLI Workflows
The Reasoning Trap: Why Higher Reasoning Effort Increases Tool Hallucination and How to Defend Your Codex CLI Workflows
A paper presented at ICLR 2026 in Rio de Janeiro this week — “The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination” — delivers a finding that every Codex CLI practitioner who routinely cranks reasoning effort to high or xhigh needs to understand: training a model to reason harder makes it hallucinate tools more, not less 1. The effect is causal, proportional, and — with current mitigation techniques — only partially fixable.
This article unpacks the research, maps its findings onto Codex CLI’s reasoning effort controls, and provides concrete configuration patterns for managing the trade-off.
What the Paper Found
Zhuang, Li, Shen, and Xu introduced SimpleToolHalluBench, a diagnostic benchmark that measures tool hallucination across two failure scenarios 1:
- No-Tool-Available (NTA): The agent is given a task but no relevant tool exists. A reliable agent should decline; a hallucinating one fabricates a tool call.
- Distractor-Tool (DT): Irrelevant tools are available. The agent should still decline but instead calls one of the distractors or invents a new tool entirely.
The Numbers Are Stark
Across every model family tested, reasoning-enhanced variants hallucinated significantly more 1:
| Model | Variant | NTA Rate | DT Rate |
|---|---|---|---|
| Qwen2.5-7B | Instruct (baseline) | 34.8% | 54.7% |
| Qwen2.5-7B | R1-Distill (reasoning RL) | 74.3% | 78.7% |
| Qwen3-8B | Think Off | 4.1% | 36.2% |
| Qwen3-8B | Think On | 5.4% | 56.8% |
| Qwen3-32B | Think Off | 5.1% | 46.6% |
| Qwen3-32B | Think On | 8.8% | 50.7% |
| DeepSeek-671B | V3 (base) | 10.8% | 33.8% |
| DeepSeek-671B | R1 (reasoning) | 17.6% | 42.6% |
The controlled ablation is the most alarming: training Qwen2.5-7B with think-then-act reinforcement learning pushed the NTA hallucination rate from 34.8% to 90.2%, and the DT rate to 100.0% — while simultaneously improving task reward from 0.22 to 0.45 1.
Why It Happens
Mechanistically, reasoning RL collapses tool-reliability representations in the model’s internal layers 1. Centred Kernel Alignment (CKA) scores show that in-distribution reasoning pathways remain stable (CKA > 0.9), but tool-related representations collapse dramatically (CKA < 0.75) in early and middle layers. The model becomes better at reasoning and worse at knowing when to stop.
Mitigation Is Partial at Best
The authors tested two interventions 1:
| Mitigation | NTA Rate | DT Rate | Reward |
|---|---|---|---|
| Baseline (ReCall-7B) | 90.2% | 100.0% | 0.45 |
| + Prompt Engineering | 87.5% | 98.9% | 0.44 |
| + Direct Preference Optimisation | 55.8% | 71.4% | 0.34 |
DPO reduced hallucination substantially — but at a 24% drop in task utility (reward fell from 0.45 to 0.34). This is the fundamental trade-off: you can have reliability or capability, but current techniques cannot fully deliver both.
What This Means for Codex CLI
Codex CLI exposes reasoning effort as a first-class configuration knob via model_reasoning_effort in config.toml 2. The available levels are none, minimal, low, medium, high, and xhigh 3. The TUI provides quick shortcuts: Alt+, lowers reasoning and Alt+. raises it 4.
The Reasoning Trap research implies a direct practical consequence: higher reasoning effort settings increase the risk of your agent fabricating tool calls, particularly when:
- MCP servers expose many tools. Schema bloat increases the distractor surface 5. With 50+ tools loaded, the model loses 7% of its context before processing a single message.
- A task doesn’t match any available tool. The agent should delegate or decline, but higher reasoning effort makes it more likely to invent a plausible-sounding tool invocation instead.
- You use third-party model providers running open-weight reasoning models (Qwen3, DeepSeek R1) where the effect is directly measured.
flowchart LR
A[Reasoning Effort<br/>Low → High] --> B[Task Performance ↑]
A --> C[Tool Hallucination ↑]
B --> D{Net Outcome}
C --> D
D --> E[Simple Tasks:<br/>Low effort wins]
D --> F[Complex Tasks:<br/>High effort + guards]
Defence-in-Depth Configuration
The Reasoning Trap does not mean you should avoid high reasoning effort — GPT-5.5 at high is demonstrably better at complex multi-step tasks 6. It means you should pair higher reasoning with stronger guardrails. Here is a layered strategy.
Layer 1: Right-Size Reasoning Effort Per Task
Create profiles in config.toml that match reasoning effort to task complexity 2 3:
# ~/.codex/config.toml
# Daily development — fast, low hallucination risk
[profiles.dev]
model = "gpt-5.5"
model_reasoning_effort = "medium"
# Complex architecture tasks — high reasoning, extra guards
[profiles.architect]
model = "gpt-5.5"
model_reasoning_effort = "high"
plan_mode_reasoning_effort = "high"
# CI automation — predictable, minimal hallucination surface
[profiles.ci]
model = "gpt-5.3-codex-mini"
model_reasoning_effort = "low"
# Security audit — maximum reasoning, maximum oversight
[profiles.security]
model = "gpt-5.2-codex"
model_reasoning_effort = "xhigh"
approval_policy = "on-request"
Switch profiles from the command line with codex --profile architect or via /profile architect in the TUI 2.
Layer 2: Constrain the Tool Surface
Fewer tools means fewer distractors. The paper’s DT scenario maps directly onto bloated MCP configurations 5.
- Audit your MCP servers. Remove any you don’t actively use. Every tool schema consumes tokens and increases the hallucination surface.
- Use profiles to scope MCP servers per task. Your
ciprofile doesn’t need the Sentry MCP server; yoursecurityprofile doesn’t need Storybook.
# Only load essential MCP servers for CI
[profiles.ci.mcp_servers.postgres]
command = "npx"
args = ["-y", "@modelcontextprotocol/server-postgres"]
Layer 3: PostToolUse Hooks for Hallucination Detection
A hallucinated tool call will either fail (the tool doesn’t exist) or produce unexpected output (the wrong tool was invoked). Use PostToolUse hooks to catch both 7:
# Reject tool calls that produce error patterns suggesting hallucination
[[hooks]]
event = "post_tool_use"
command = "bash -c 'if echo \"$CODEX_TOOL_OUTPUT\" | grep -qiE \"unknown tool|not found|invalid function|no such command\"; then echo \"{\\\"decision\\\": \\\"report_error\\\", \\\"message\\\": \\\"Possible tool hallucination detected — tool output suggests the called tool does not exist.\\\"}\" && exit 1; fi'"
Layer 4: AGENTS.md Anti-Hallucination Policy
Prompt engineering showed limited effect in the paper (87.5% → 90.2% is marginal), but Codex CLI’s AGENTS.md provides persistent, session-wide context — a stronger delivery mechanism than a one-off prompt 8.
## Tool Use Policy
- NEVER fabricate tool calls. If no available tool matches the task, say so
explicitly and ask the user for guidance.
- Before calling any MCP tool, verify it exists in the current session's tool
list. If uncertain, use the MCP tool discovery mechanism first.
- Prefer built-in tools (read_file, apply_patch, rg, git) over MCP tools
when both can accomplish the task.
- When reasoning effort is set to high or xhigh, double-check tool names
against the schema before invocation.
Layer 5: Approval Mode as the Final Gate
For high-reasoning-effort profiles working with extensive MCP tooling, use on-request approval to catch hallucinated tool calls before execution 9:
[profiles.architect]
model = "gpt-5.5"
model_reasoning_effort = "high"
approval_policy = "on-request"
This introduces friction but provides a human checkpoint exactly where the research predicts the highest hallucination risk.
The Reasoning Effort Decision Framework
| Scenario | Recommended Effort | Rationale |
|---|---|---|
| Routine file edits, simple bug fixes | low or medium |
Low complexity, low hallucination risk, fast iteration |
| Multi-file refactoring, feature implementation | medium |
Good balance; built-in tools sufficient |
| Architecture decisions, complex debugging | high |
Needs deep reasoning; pair with approval mode |
| Security audits, long-horizon tasks | high or xhigh |
Maximum capability; full guardrails mandatory |
CI/CD automation (codex exec) |
low |
Predictability matters more than creativity |
| MCP-heavy workflows (5+ servers) | medium max |
High tool surface amplifies hallucination risk |
Implications for Third-Party Providers
Teams running Codex CLI with open-weight models via Ollama or custom providers should pay particular attention. The paper measured the effect directly on Qwen and DeepSeek families 1:
- Qwen3-8B with thinking enabled: DT hallucination jumps from 36.2% to 56.8%
- DeepSeek R1-671B vs DeepSeek V3: NTA rate rises from 10.8% to 17.6%
If you use these models locally, the defence layers above become even more critical. OpenAI’s proprietary models (GPT-5.5, GPT-5.2-Codex) likely have internal mitigations not present in open-weight alternatives, but the underlying mechanism — reasoning RL collapsing tool-reliability representations — is model-architecture agnostic 1.
Verifying Your Setup
Run a quick smoke test to check whether your configuration is resilient to tool hallucination:
# Ask the agent to perform a task with no matching tool
codex exec --profile architect \
"List all Kubernetes pods in the staging namespace" \
2>/dev/null
# Expected: agent should say it cannot perform this without kubectl/K8s MCP
# Red flag: agent fabricates a tool call like 'kubectl_list_pods'
If your agent invents a tool, tighten the layers above — particularly the AGENTS.md policy and approval mode.
Key Takeaways
- Higher reasoning effort improves task performance but increases tool hallucination. This is a causal, measured effect — not speculation.
- The distractor-tool scenario is worst. Bloated MCP configurations directly map onto this failure mode. Audit your tool surface.
- Prompt engineering alone is insufficient. The paper showed only a 2.7 percentage point improvement. Use structural guardrails: hooks, approval modes, and scoped profiles.
- Right-size reasoning per task. Don’t run
xhighfor everything. Match effort to complexity, and pair high effort with proportionally stronger oversight. - Open-weight models are more exposed. If you’re running Qwen3 or DeepSeek R1 via a custom provider, the hallucination rates are directly measured and significant.
The Reasoning Trap is not a reason to avoid powerful reasoning models — it’s a reason to deploy them thoughtfully. Codex CLI’s layered configuration system provides every tool you need; the discipline is in using them.
Citations
-
Zhuang, Y., Li, Z., Shen, Y., & Xu, C. (2025). “The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination.” ICLR 2026. arXiv:2510.22977 ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8
-
OpenAI. “Config basics — Codex.” developers.openai.com/codex/config-basic ↩ ↩2 ↩3
-
OpenAI. “Configuration Reference — Codex.” developers.openai.com/codex/config-reference ↩ ↩2
-
OpenAI. “Features — Codex CLI.” developers.openai.com/codex/cli/features ↩
-
LeanIX Engineering. “Why Your AI Agent Is Drowning in Tools (And How Code Mode Saves It).” engineering.leanix.net/blog/code-mode/ ↩ ↩2
-
OpenAI. “Models — Codex.” developers.openai.com/codex/models ↩
-
OpenAI. “Hooks — Codex.” developers.openai.com/codex/hooks ↩
-
OpenAI. “Custom instructions with AGENTS.md — Codex.” developers.openai.com/codex/guides/agents-md ↩
-
OpenAI. “Agent approvals & security — Codex.” developers.openai.com/codex/agent-approvals-security ↩