Codex CLI Performance Optimisation: Token Overhead, Hidden Costs and Tuning Tactics

Every Codex CLI session burns tokens. Most developers have a rough sense of the cost—prompts in, completions out—but the reality is more nuanced. System prompts, tool definitions, MCP server schemas, reasoning traces and context compaction all consume tokens invisibly. This article dissects where your tokens actually go and provides concrete tactics for cutting waste without sacrificing output quality.

Understanding the Token Budget

Each model exposes a different context window, and understanding the effective capacity—after overhead—is the first step to optimisation.

Model	Input Window	Total Context	Typical Use Case
GPT-5.4	1M	1M	Flagship; recommended default ¹
GPT-5.4-mini	400K	400K	Subagent work; 30% of GPT-5.4 quota ¹
GPT-5.3-Codex	272K	400K	Coding specialist ¹
GPT-5.3-Codex-Spark	128K	128K	Near-instant iteration; 1,000+ tok/s ²
GPT-5.1-Codex-Mini	272K	400K	Cost-sensitive tasks ¹

The input window is what you can send; total context includes the model’s reasoning and output tokens. Your usable capacity is the input window minus all overhead.

Where Your Tokens Go

System Prompt and AGENTS.md

Every API call includes a system prompt composed from Codex’s built-in instructions plus any AGENTS.md tiers you’ve defined. This typically costs 2–5K tokens per turn ³. The good news: OpenAI’s prompt caching means this payload is loaded once and cached thereafter, so subsequent turns in the same session pay a fraction of the original cost ³.

Tool Definitions

Each registered tool—whether built-in or from an MCP server—adds its JSON schema to every request. Built-in tools are relatively lean, but the cost compounds quickly:

~500 tokens per registered tool for built-in Codex tools ³
~200–500 tokens per connected MCP server for server-level overhead ³
~550–1,400 tokens per individual MCP tool definition ⁴

A typical four-server MCP setup adds roughly 7,000 tokens of overhead per turn ⁵. Heavy configurations with five or more servers can burn 50,000+ tokens before you type your first prompt ⁴.

File Reads and Attachments

When Codex reads a file—via @file references or tool calls—the full content enters the context window. A 500-line module might cost 15K tokens of input ³. There is no automatic truncation; if you point Codex at a large file, every byte counts.

Reasoning Traces

Reasoning effort has a dramatic impact on token consumption. At xhigh, Codex uses 3–5× more tokens than medium for the same prompt ³. The reasoning tokens are consumed server-side (they don’t appear in your visible output), but they count against your quota and billing.

graph LR
    A[Your Prompt] --> B[System Prompt\n2-5K tokens]
    B --> C[Tool Definitions\n500-50K tokens]
    C --> D[File Reads\nvariable]
    D --> E[Conversation History\ngrowing]
    E --> F[Reasoning Trace\n1-5x multiplier]
    F --> G[Model Output]

    style B fill:#f9d71c,stroke:#333
    style C fill:#ff6b6b,stroke:#333
    style F fill:#ff6b6b,stroke:#333

Background Retries and Compaction Triggers

When a request fails (rate limit, transient error), Codex retries with exponential backoff ⁶. Each retry re-sends the full context. If you’re near the compaction threshold, a failed request followed by a retry can trigger compaction mid-conversation—consuming additional tokens for the summarisation call itself.

Monitoring: Know What You’re Spending

The `/status` Command

The /status slash command displays your current model, token usage, git branch and sandbox mode ³. Critically, the token count includes all overhead, not just your visible messages ⁵. If costs surprise you, /status is your first diagnostic.

The Status Bar

Codex CLI’s TUI shows a persistent status line with context usage percentage. Watch this during long sessions—when it climbs past 80%, you’re approaching compaction territory.

Token Usage Breakdown (Requested Feature)

As of April 2026, there is no built-in breakdown showing tokens by source (system prompt vs tools vs history vs reasoning). This has been requested as a feature in GitHub issue #13222 ⁷. Until it ships, you’ll need to estimate overhead from your configuration.

Context Compaction: How It Works

When token usage exceeds model_auto_compact_token_limit, Codex triggers automatic history compaction ⁶:

The entire conversation history is sent with a dedicated summarisation prompt
The model produces a compressed summary
A new history is constructed: initial context + recent user messages (up to ~20K tokens) + summary
The old history is replaced

flowchart TD
    A[Token usage exceeds threshold] --> B[Send history + compaction prompt to model]
    B --> C[Model generates summary]
    C --> D[Construct new history]
    D --> E[Initial context + Recent messages ~20K + Summary]
    E --> F[Replace session history]
    F --> G[Continue conversation with freed capacity]

    style A fill:#ff6b6b,stroke:#333
    style G fill:#4ecdc4,stroke:#333

The Quality Trade-Off

Multiple compactions cause cumulative information loss ⁶. Each round discards detail from earlier exchanges. After two or three compactions, the model may lose track of decisions made early in the session. The official documentation explicitly warns that “long conversations and multiple compactions can cause the model to be less accurate” ⁶.

Manual Compaction with `/compact`

You can trigger compaction manually at any time with /compact. This is useful when you’ve finished a phase of work (investigation, say) and want to free tokens before starting implementation. Manual compaction at ~60% context usage gives better results than waiting for the automatic 95% trigger ⁸.

Configuration for Token Control

All settings live in ~/.codex/config.toml (user-level) or .codex/config.toml (project-level).

Essential Token Management Keys

# Override model context window (tokens)
model_context_window = 272000

# Trigger auto-compaction at this token count
model_auto_compact_token_limit = 64000

# Budget per tool output (prevents runaway file reads)
tool_output_token_limit = 12000

# Custom compaction prompt for domain-specific summarisation
compact_prompt = "Summarise focusing on architectural decisions and file paths modified"

# Or load from file (experimental)
experimental_compact_prompt_file = "/path/to/compact_prompt.txt"

# Cap history file size
[history]
max_bytes = 5242880

Setting model_auto_compact_token_limit lower than the default forces earlier compaction. This trades a small summarisation cost for consistently lower context pressure ⁹.

Reasoning Effort Tuning

Match reasoning depth to task complexity rather than using a uniform setting:

# Default: medium for routine tasks
model_reasoning_effort = "medium"

# Higher effort only in plan mode where it matters
plan_mode_reasoning_effort = "high"

# Suppress reasoning summaries when you don't need explanations
model_reasoning_summary = "none"

The minimal reasoning level is only available for GPT-5 models ³. Reserve xhigh for genuinely hard problems—architectural decisions, complex refactors, security audits—where the extra thinking demonstrably improves output.

Verbosity Control

# Reduce output verbosity (GPT-5 Responses API)
model_verbosity = "low"

Lower verbosity produces terser responses, saving output tokens without affecting the quality of code generation ⁹.

Practical Tuning Tactics

1. Audit Your MCP Servers

Every connected MCP server injects its full tool catalogue into every request. Run /status and count your active servers. If you have servers you use infrequently, disable them in your configuration and enable them only when needed ⁵.

For context: a GitHub MCP server with 93 tools consumes approximately 55,000 tokens per turn ⁴. If you only need basic git operations, the built-in tools or a shell command costs a fraction of that.

2. Use Profiles for Different Tasks

Create separate profiles that match task complexity to model and reasoning settings:

# In a [profile.quick] section
[profile.quick]
model = "gpt-5.4-mini"
model_reasoning_effort = "low"
model_verbosity = "low"

# In a [profile.deep] section
[profile.deep]
model = "gpt-5.4"
model_reasoning_effort = "high"
plan_mode_reasoning_effort = "xhigh"

Switch profiles with codex --profile quick for routine tasks and codex --profile deep for complex work ⁹.

3. Start Fresh Threads Between Phases

Don’t drag investigation context into implementation. After finishing research or debugging, start a new session or use /clear to reset. This prevents stale context from consuming tokens during the productive phase ⁸.

4. Prefer `codex exec` for Automation

The codex exec command skips the TUI entirely, reducing overhead in CI/CD pipelines and scripted workflows. Combined with --model gpt-5.4-mini, this is the most token-efficient way to run batch operations ³.

5. Leverage Prompt Caching

Prompt caching provides substantial savings on repeated requests. Cached input tokens are billed at roughly 10% of the uncached rate (e.g., 6.25 credits/1M vs 62.50 credits/1M for GPT-5.4) ¹⁰. To maximise cache hits:

Keep your system prompt and AGENTS.md stable across sessions
Avoid unnecessary changes to tool configurations mid-session
Use --resume to continue sessions rather than starting fresh for related work (noting that a cache regression in v2.1.69 has since been fixed) ⁵

6. Set `tool_output_token_limit`

Large file reads can blow out your context window in a single turn. Setting tool_output_token_limit = 12000 caps the tokens stored per tool output ⁹, forcing Codex to work with summaries rather than entire files. This is particularly valuable when the model reads log files or large generated outputs.

Cost Estimation Reference

For API key users on token-based billing, here are representative costs per task at GPT-5.4 rates ³:

Task	Input Tokens	Output Tokens	Approximate Cost
Explain a 500-line module	~15K	~2K	~$0.25
Refactor auth module (10 files)	~120K	~30K	~$2.25
Full repository audit	~200K	~20K	~$3.00

These figures include overhead. Subscription users (Plus at $20/month, Pro at $200/month) pay through message limits rather than per-token, but the same optimisation tactics extend your effective messages per five-hour window ¹⁰.

The Optimisation Checklist

Before your next long Codex CLI session, run through this:

Check MCP server count — disable any you won’t use
Set model_reasoning_effort appropriate to the task
Configure model_auto_compact_token_limit for early compaction
Set tool_output_token_limit to prevent runaway file reads
Use /status periodically to monitor context usage
Start fresh threads between distinct work phases
Use codex exec for scripted/batch operations

Token optimisation in Codex CLI isn’t about penny-pinching—it’s about maintaining model accuracy by keeping context clean and relevant. A session running at 95% context capacity with three compactions behind it will produce measurably worse output than a fresh session with focused context. Treat your token budget as a quality indicator, not just a cost metric.

Codex CLI Performance Optimisation: Token Overhead, Hidden Costs and Tuning Tactics

Understanding the Token Budget

Where Your Tokens Go

System Prompt and AGENTS.md

Tool Definitions

File Reads and Attachments

Reasoning Traces

Background Retries and Compaction Triggers

Monitoring: Know What You’re Spending

The /status Command

The Status Bar

Token Usage Breakdown (Requested Feature)

Context Compaction: How It Works

The Quality Trade-Off

Manual Compaction with /compact

Configuration for Token Control

Essential Token Management Keys

Reasoning Effort Tuning

Verbosity Control

Practical Tuning Tactics

1. Audit Your MCP Servers

2. Use Profiles for Different Tasks

3. Start Fresh Threads Between Phases

4. Prefer codex exec for Automation

5. Leverage Prompt Caching

6. Set tool_output_token_limit

Cost Estimation Reference

The Optimisation Checklist

Citations

The `/status` Command

Manual Compaction with `/compact`

4. Prefer `codex exec` for Automation

6. Set `tool_output_token_limit`