Dynamic Model Routing in Codex CLI: Mid-Session Switching, /fast Mode, and Service Tier Workflows

Not every turn in a Codex CLI session demands the same model, the same speed, or the same reasoning depth. A planning pass benefits from deep deliberation; a batch of file renames does not. Since v0.117.0, Codex CLI lets you change the model, reasoning effort, fast mode, and service tier within a running session — without losing context or restarting. This article explains every lever, how they interact, and how to compose them into cost-effective workflow patterns.

The Four Routing Levers

Codex CLI exposes four independent controls that together determine what model runs, how hard it thinks, how quickly it responds, and how much it costs per turn.

graph LR
    A["/model"] --> B["Model Identity"]
    C["/fast"] --> D["Speed Tier"]
    E["model_reasoning_effort"] --> F["Thinking Depth"]
    G["service_tier"] --> H["Priority Queue"]
    B --> I["Turn Cost & Quality"]
    D --> I
    F --> I
    H --> I

Lever	Slash Command	Config Key	Values
Model	`/model`	`model`	`gpt-5.4`, `gpt-5.4-mini`, `gpt-5.3-codex`, `gpt-5.3-codex-spark`
Fast mode	`/fast on/off/status`	`features.fast_mode`, `service_tier = "fast"`	on / off
Reasoning effort	via `/model` picker	`model_reasoning_effort`	`minimal`, `low`, `medium`, `high`, `xhigh`
Service tier	CLI flag only	`service_tier`	`fast`, `flex`

Switching Models Mid-Session with /model

The /model command opens a picker listing every model available to your account¹. Selecting a new model applies it to all subsequent turns in the session — the full conversation history carries forward without truncation or re-encoding².

# Start a session with the default model
codex

# After initial planning, switch to a cheaper model for implementation
/model
# → Select gpt-5.4-mini from the picker

When the active provider is a local server (Ollama or LM Studio), /model fetches the available model list directly from the running endpoint and presents a searchable picker³. Subsequent selections of the same model switch instantly without re-prompting.

What Happens to Context on Switch

The conversation transcript, plan history, and approval decisions persist across model switches⁴. The new model receives the same input context — it simply produces its next response using different weights. This means you can:

Plan with gpt-5.4 at high reasoning effort
Switch to gpt-5.4-mini for rapid implementation
Switch back to gpt-5.4 for a final review pass

Each model sees the full thread. The only constraint is context window size — if you switch to a model with a smaller window, compaction may trigger earlier.

/fast Mode: 1.5x Speed at 2x Credits

Fast mode is currently exclusive to GPT-5.4⁵. When enabled, the model runs 1.5× faster with identical intelligence and reasoning capability — the speed-up comes from infrastructure prioritisation, not model degradation⁶.

Toggling Fast Mode

/fast on      # Enable fast mode
/fast off     # Disable fast mode
/fast status  # Check current setting

Persistent Configuration

To default to fast mode for all sessions:

# ~/.codex/config.toml
service_tier = "fast"

[features]
fast_mode = true

When to Use Fast Mode

Fast mode excels during interactive development where maintaining flow state matters more than credit efficiency⁷:

Live debugging sessions — rapid iteration on failing tests
Interactive prototyping — quick feedback on UI changes
Steer mode corrections — mid-turn steering benefits from faster response

Fast mode is wasteful for background tasks, batch processing, or overnight codex exec runs where latency is irrelevant.

Availability note: Fast mode requires a ChatGPT subscription (Plus or Pro). API key users cannot access fast mode credits and instead use standard API pricing⁵.

Service Tiers: fast vs flex

Beyond the /fast toggle, Codex CLI supports a service_tier configuration that controls the priority queue for API requests⁸.

Tier	Speed	Cost	Best For
`fast`	1.5× baseline	2× credits	Interactive sessions, live debugging
(default)	Baseline	1× credits	General development
`flex`	Variable (higher latency)	~50% cheaper	CI/CD pipelines, batch processing, overnight tasks

The flex tier accepts additional latency in exchange for significant cost savings⁹. It is ideal for non-time-sensitive workloads like:

# Profile for CI pipeline runs
[profiles.ci]
model = "gpt-5.4-mini"
service_tier = "flex"
model_reasoning_effort = "medium"

Reasoning Effort: Five Levels of Thinking Depth

The model_reasoning_effort parameter controls how many reasoning tokens the model generates before producing its response¹⁰. Lower effort favours speed and token efficiency; higher effort favours deeper analysis.

Level	Use Case	Relative Cost
`minimal`	Boilerplate, simple renames	Lowest
`low`	Straightforward CRUD, formatting	Low
`medium`	General development (recommended default)	Moderate
`high`	Complex refactoring, architecture decisions	High
`xhigh`	Deep debugging, security review, benchmark-grade quality	Highest

GPT-5.4 achieved 57.7% on SWE-Bench Pro in xhigh mode¹¹ — but for routine file operations, medium produces equivalent results at a fraction of the cost.

Plan Mode Reasoning Override

Codex CLI supports a separate reasoning effort for plan mode via plan_mode_reasoning_effort¹². When unset, plan mode uses its built-in preset default (currently medium). When explicitly set — including to none — it overrides the preset:

# Daily driver config
model_reasoning_effort = "medium"
plan_mode_reasoning_effort = "high"

This pattern uses deeper reasoning only during planning, then reverts to standard effort during execution.

Additional Reasoning Controls

Two lesser-known config keys fine-tune reasoning output:

# Control reasoning summary verbosity
model_reasoning_summary = "concise"    # auto | concise | detailed | none

# Force reasoning metadata for custom providers
model_supports_reasoning_summaries = true

Composing the Levers: Workflow Patterns

Pattern 1: The Cost-Conscious Sprint

Start cheap, escalate only when needed.

sequenceDiagram
    participant Dev as Developer
    participant C as Codex CLI
    Dev->>C: /model → gpt-5.4-mini (medium effort)
    Note over C: Implement feature scaffolding
    Dev->>C: /model → gpt-5.4 (high effort)
    Note over C: Review and refactor complex logic
    Dev->>C: /model → gpt-5.4-mini (medium effort)
    Note over C: Write tests and documentation

Estimated savings: GPT-5.4-mini costs 18.75 credits/M input tokens vs 62.50 for GPT-5.4¹³ — a 70% reduction on implementation and testing turns.

Pattern 2: The Interactive Deep Session

Maximise speed for flow state, then review.

# Start with fast mode for rapid iteration
[features]
fast_mode = true

# Switch off for final review
# /fast off (mid-session)

# Session workflow:
codex
# → Rapid prototyping with /fast on, gpt-5.4, medium effort
# → When stuck: /fast off, increase to high effort
# → Final review: /model gpt-5.4, xhigh effort, /fast off

Pattern 3: The CI Pipeline Optimiser

Use profiles to hard-code the cheapest viable configuration:

[profiles.ci]
model = "gpt-5.4-mini"
model_reasoning_effort = "low"
service_tier = "flex"

[profiles.deep-review]
model = "gpt-5.4"
model_reasoning_effort = "xhigh"

# CI runs use the cheap profile
codex --profile ci exec "Run the test suite and fix failures"

# Code review uses the expensive profile
codex --profile deep-review exec review --base main

Pattern 4: Spark for Drafts, Flagship for Final

GPT-5.3-Codex-Spark delivers 1,000+ tokens per second on Cerebras WSE-3 hardware¹⁴, making it ideal for rapid draft generation:

codex
/model
# → Select gpt-5.3-codex-spark
# Generate 3-4 implementation approaches rapidly

/model
# → Select gpt-5.4
# Review and select the best approach

Access note: Spark is restricted to ChatGPT Pro subscribers and is text-only with a 128K context window¹⁵.

The Model Routing Decision Matrix

flowchart TD
    A["New Turn"] --> B{"Interactive session?"}
    B -->|Yes| C{"Stuck or complex?"}
    B -->|No: CI/batch| D["gpt-5.4-mini + flex + low"]
    C -->|Yes| E["gpt-5.4 + high/xhigh"]
    C -->|No: routine| F{"Need speed?"}
    F -->|Yes| G["gpt-5.4 + /fast on + medium"]
    F -->|No| H["gpt-5.4-mini + medium"]
    E --> I{"Resolved?"}
    I -->|Yes| F
    I -->|No| J["gpt-5.4 + xhigh + /fast off"]

Subagent Model Routing

When using multi-agent workflows, each subagent can specify its own model in its TOML definition file under .codex/agents/¹⁶:

# .codex/agents/implementer.toml
model = "gpt-5.4-mini"
model_reasoning_effort = "medium"

# .codex/agents/reviewer.toml
model = "gpt-5.4"
model_reasoning_effort = "high"

This creates a natural cost hierarchy: cheap models for implementation workers, expensive models for review and orchestration. The [agents] config section controls parallelism:

[agents]
max_threads = 4
max_depth = 1

Combined with model routing, a four-subagent swarm using gpt-5.4-mini costs roughly the same as a single gpt-5.4 agent — but produces four parallel work streams.

Token Economics of Dynamic Routing

The April 2026 credit-based pricing makes the cost difference between routing strategies concrete¹³:

Model	Input (credits/M)	Output (credits/M)	Cached (credits/M)
GPT-5.4	62.50	375.00	6.25
GPT-5.4-mini	18.75	113.00	1.875
GPT-5.3-Codex	43.75	350.00	4.375

With /fast on, all rates double. With service_tier = "flex", rates drop by approximately 50%⁹.

Worked example: A 20-turn session where 14 turns are routine implementation and 6 turns are complex review:

All GPT-5.4: ~1,250 output credits (assuming 200K output tokens)
Routed (14×mini + 6×5.4): ~810 output credits — a 35% saving
Routed + flex on routine turns: ~695 output credits — a 44% saving

Configuration Precedence

When multiple configuration sources set different models or tiers, Codex CLI resolves them in this order (highest priority first)¹⁷:

Mid-session /model or /fast command
CLI flags (-m, --service-tier, -c key=value)
Project config (.codex/config.toml)
User config (~/.codex/config.toml)
System/managed config
Built-in defaults

A /model switch mid-session overrides everything — including profile settings. This is intentional: the developer at the keyboard always has final say.

Known Limitations

No per-turn model in codex exec: The codex exec non-interactive mode uses a single model for the entire run. Dynamic switching requires the interactive TUI¹⁸.
Spark context ceiling: GPT-5.3-Codex-Spark has a 128K context window vs 1M for GPT-5.4¹⁵. Switching to Spark late in a long session may trigger immediate compaction.
Fast mode is GPT-5.4 only: The /fast toggle has no effect on other models⁵.
No reasoning slash command yet: While /model includes a reasoning effort picker, a dedicated /reasoning slash command was requested in issue #2106 but has not yet shipped¹⁹.
⚠️ The interaction between service_tier = "flex" and /fast on within the same session is not well-documented. In testing, /fast on appears to override the flex tier, but this behaviour may change.

Dynamic Model Routing in Codex CLI: Mid-Session Switching, /fast Mode, and Service Tier Workflows

The Four Routing Levers

Switching Models Mid-Session with /model

What Happens to Context on Switch

/fast Mode: 1.5x Speed at 2x Credits

Toggling Fast Mode

Persistent Configuration

When to Use Fast Mode

Service Tiers: fast vs flex

Reasoning Effort: Five Levels of Thinking Depth

Plan Mode Reasoning Override

Additional Reasoning Controls

Composing the Levers: Workflow Patterns

Pattern 1: The Cost-Conscious Sprint

Pattern 2: The Interactive Deep Session

Pattern 3: The CI Pipeline Optimiser

Pattern 4: Spark for Drafts, Flagship for Final

The Model Routing Decision Matrix

Subagent Model Routing

Token Economics of Dynamic Routing

Configuration Precedence

Known Limitations

Citations