Dynamic Model Routing in Codex CLI: Mid-Session Switching, /fast Mode, and Service Tier Workflows
Dynamic Model Routing in Codex CLI: Mid-Session Switching, /fast Mode, and Service Tier Workflows
Not every turn in a Codex CLI session demands the same model, the same speed, or the same reasoning depth. A planning pass benefits from deep deliberation; a batch of file renames does not. Since v0.117.0, Codex CLI lets you change the model, reasoning effort, fast mode, and service tier within a running session — without losing context or restarting. This article explains every lever, how they interact, and how to compose them into cost-effective workflow patterns.
The Four Routing Levers
Codex CLI exposes four independent controls that together determine what model runs, how hard it thinks, how quickly it responds, and how much it costs per turn.
graph LR
A["/model"] --> B["Model Identity"]
C["/fast"] --> D["Speed Tier"]
E["model_reasoning_effort"] --> F["Thinking Depth"]
G["service_tier"] --> H["Priority Queue"]
B --> I["Turn Cost & Quality"]
D --> I
F --> I
H --> I
| Lever | Slash Command | Config Key | Values |
|---|---|---|---|
| Model | /model |
model |
gpt-5.4, gpt-5.4-mini, gpt-5.3-codex, gpt-5.3-codex-spark |
| Fast mode | /fast on/off/status |
features.fast_mode, service_tier = "fast" |
on / off |
| Reasoning effort | via /model picker |
model_reasoning_effort |
minimal, low, medium, high, xhigh |
| Service tier | CLI flag only | service_tier |
fast, flex |
Switching Models Mid-Session with /model
The /model command opens a picker listing every model available to your account1. Selecting a new model applies it to all subsequent turns in the session — the full conversation history carries forward without truncation or re-encoding2.
# Start a session with the default model
codex
# After initial planning, switch to a cheaper model for implementation
/model
# → Select gpt-5.4-mini from the picker
When the active provider is a local server (Ollama or LM Studio), /model fetches the available model list directly from the running endpoint and presents a searchable picker3. Subsequent selections of the same model switch instantly without re-prompting.
What Happens to Context on Switch
The conversation transcript, plan history, and approval decisions persist across model switches4. The new model receives the same input context — it simply produces its next response using different weights. This means you can:
- Plan with
gpt-5.4athighreasoning effort - Switch to
gpt-5.4-minifor rapid implementation - Switch back to
gpt-5.4for a final review pass
Each model sees the full thread. The only constraint is context window size — if you switch to a model with a smaller window, compaction may trigger earlier.
/fast Mode: 1.5x Speed at 2x Credits
Fast mode is currently exclusive to GPT-5.45. When enabled, the model runs 1.5× faster with identical intelligence and reasoning capability — the speed-up comes from infrastructure prioritisation, not model degradation6.
Toggling Fast Mode
/fast on # Enable fast mode
/fast off # Disable fast mode
/fast status # Check current setting
Persistent Configuration
To default to fast mode for all sessions:
# ~/.codex/config.toml
service_tier = "fast"
[features]
fast_mode = true
When to Use Fast Mode
Fast mode excels during interactive development where maintaining flow state matters more than credit efficiency7:
- Live debugging sessions — rapid iteration on failing tests
- Interactive prototyping — quick feedback on UI changes
- Steer mode corrections — mid-turn steering benefits from faster response
Fast mode is wasteful for background tasks, batch processing, or overnight codex exec runs where latency is irrelevant.
Availability note: Fast mode requires a ChatGPT subscription (Plus or Pro). API key users cannot access fast mode credits and instead use standard API pricing5.
Service Tiers: fast vs flex
Beyond the /fast toggle, Codex CLI supports a service_tier configuration that controls the priority queue for API requests8.
| Tier | Speed | Cost | Best For |
|---|---|---|---|
fast |
1.5× baseline | 2× credits | Interactive sessions, live debugging |
| (default) | Baseline | 1× credits | General development |
flex |
Variable (higher latency) | ~50% cheaper | CI/CD pipelines, batch processing, overnight tasks |
The flex tier accepts additional latency in exchange for significant cost savings9. It is ideal for non-time-sensitive workloads like:
# Profile for CI pipeline runs
[profiles.ci]
model = "gpt-5.4-mini"
service_tier = "flex"
model_reasoning_effort = "medium"
Reasoning Effort: Five Levels of Thinking Depth
The model_reasoning_effort parameter controls how many reasoning tokens the model generates before producing its response10. Lower effort favours speed and token efficiency; higher effort favours deeper analysis.
| Level | Use Case | Relative Cost |
|---|---|---|
minimal |
Boilerplate, simple renames | Lowest |
low |
Straightforward CRUD, formatting | Low |
medium |
General development (recommended default) | Moderate |
high |
Complex refactoring, architecture decisions | High |
xhigh |
Deep debugging, security review, benchmark-grade quality | Highest |
GPT-5.4 achieved 57.7% on SWE-Bench Pro in xhigh mode11 — but for routine file operations, medium produces equivalent results at a fraction of the cost.
Plan Mode Reasoning Override
Codex CLI supports a separate reasoning effort for plan mode via plan_mode_reasoning_effort12. When unset, plan mode uses its built-in preset default (currently medium). When explicitly set — including to none — it overrides the preset:
# Daily driver config
model_reasoning_effort = "medium"
plan_mode_reasoning_effort = "high"
This pattern uses deeper reasoning only during planning, then reverts to standard effort during execution.
Additional Reasoning Controls
Two lesser-known config keys fine-tune reasoning output:
# Control reasoning summary verbosity
model_reasoning_summary = "concise" # auto | concise | detailed | none
# Force reasoning metadata for custom providers
model_supports_reasoning_summaries = true
Composing the Levers: Workflow Patterns
Pattern 1: The Cost-Conscious Sprint
Start cheap, escalate only when needed.
sequenceDiagram
participant Dev as Developer
participant C as Codex CLI
Dev->>C: /model → gpt-5.4-mini (medium effort)
Note over C: Implement feature scaffolding
Dev->>C: /model → gpt-5.4 (high effort)
Note over C: Review and refactor complex logic
Dev->>C: /model → gpt-5.4-mini (medium effort)
Note over C: Write tests and documentation
Estimated savings: GPT-5.4-mini costs 18.75 credits/M input tokens vs 62.50 for GPT-5.413 — a 70% reduction on implementation and testing turns.
Pattern 2: The Interactive Deep Session
Maximise speed for flow state, then review.
# Start with fast mode for rapid iteration
[features]
fast_mode = true
# Switch off for final review
# /fast off (mid-session)
# Session workflow:
codex
# → Rapid prototyping with /fast on, gpt-5.4, medium effort
# → When stuck: /fast off, increase to high effort
# → Final review: /model gpt-5.4, xhigh effort, /fast off
Pattern 3: The CI Pipeline Optimiser
Use profiles to hard-code the cheapest viable configuration:
[profiles.ci]
model = "gpt-5.4-mini"
model_reasoning_effort = "low"
service_tier = "flex"
[profiles.deep-review]
model = "gpt-5.4"
model_reasoning_effort = "xhigh"
# CI runs use the cheap profile
codex --profile ci exec "Run the test suite and fix failures"
# Code review uses the expensive profile
codex --profile deep-review exec review --base main
Pattern 4: Spark for Drafts, Flagship for Final
GPT-5.3-Codex-Spark delivers 1,000+ tokens per second on Cerebras WSE-3 hardware14, making it ideal for rapid draft generation:
codex
/model
# → Select gpt-5.3-codex-spark
# Generate 3-4 implementation approaches rapidly
/model
# → Select gpt-5.4
# Review and select the best approach
Access note: Spark is restricted to ChatGPT Pro subscribers and is text-only with a 128K context window15.
The Model Routing Decision Matrix
flowchart TD
A["New Turn"] --> B{"Interactive session?"}
B -->|Yes| C{"Stuck or complex?"}
B -->|No: CI/batch| D["gpt-5.4-mini + flex + low"]
C -->|Yes| E["gpt-5.4 + high/xhigh"]
C -->|No: routine| F{"Need speed?"}
F -->|Yes| G["gpt-5.4 + /fast on + medium"]
F -->|No| H["gpt-5.4-mini + medium"]
E --> I{"Resolved?"}
I -->|Yes| F
I -->|No| J["gpt-5.4 + xhigh + /fast off"]
Subagent Model Routing
When using multi-agent workflows, each subagent can specify its own model in its TOML definition file under .codex/agents/16:
# .codex/agents/implementer.toml
model = "gpt-5.4-mini"
model_reasoning_effort = "medium"
# .codex/agents/reviewer.toml
model = "gpt-5.4"
model_reasoning_effort = "high"
This creates a natural cost hierarchy: cheap models for implementation workers, expensive models for review and orchestration. The [agents] config section controls parallelism:
[agents]
max_threads = 4
max_depth = 1
Combined with model routing, a four-subagent swarm using gpt-5.4-mini costs roughly the same as a single gpt-5.4 agent — but produces four parallel work streams.
Token Economics of Dynamic Routing
The April 2026 credit-based pricing makes the cost difference between routing strategies concrete13:
| Model | Input (credits/M) | Output (credits/M) | Cached (credits/M) |
|---|---|---|---|
| GPT-5.4 | 62.50 | 375.00 | 6.25 |
| GPT-5.4-mini | 18.75 | 113.00 | 1.875 |
| GPT-5.3-Codex | 43.75 | 350.00 | 4.375 |
With /fast on, all rates double. With service_tier = "flex", rates drop by approximately 50%9.
Worked example: A 20-turn session where 14 turns are routine implementation and 6 turns are complex review:
- All GPT-5.4: ~1,250 output credits (assuming 200K output tokens)
- Routed (14×mini + 6×5.4): ~810 output credits — a 35% saving
- Routed + flex on routine turns: ~695 output credits — a 44% saving
Configuration Precedence
When multiple configuration sources set different models or tiers, Codex CLI resolves them in this order (highest priority first)17:
- Mid-session
/modelor/fastcommand - CLI flags (
-m,--service-tier,-c key=value) - Project config (
.codex/config.toml) - User config (
~/.codex/config.toml) - System/managed config
- Built-in defaults
A /model switch mid-session overrides everything — including profile settings. This is intentional: the developer at the keyboard always has final say.
Known Limitations
- No per-turn model in codex exec: The
codex execnon-interactive mode uses a single model for the entire run. Dynamic switching requires the interactive TUI18. - Spark context ceiling: GPT-5.3-Codex-Spark has a 128K context window vs 1M for GPT-5.415. Switching to Spark late in a long session may trigger immediate compaction.
- Fast mode is GPT-5.4 only: The
/fasttoggle has no effect on other models5. - No reasoning slash command yet: While
/modelincludes a reasoning effort picker, a dedicated/reasoningslash command was requested in issue #2106 but has not yet shipped19. - ⚠️ The interaction between
service_tier = "flex"and/fast onwithin the same session is not well-documented. In testing,/fast onappears to override the flex tier, but this behaviour may change.