Codex Blog

AGENTS.md as an Open Standard: Cross-Tool Portability Under Linux Foundation Governance

2026-04-07T00:00:00+01:00

AGENTS.md as an Open Standard: Cross-Tool Portability Under Linux Foundation Governance

The AGENTS.md file that sits in your repository root has quietly become the most consequential configuration standard in agentic coding. What began as an OpenAI-originated convention for guiding Codex CLI is now a Linux Foundation project supported by over 25 tools and adopted by more than 60,000 open-source repositories¹. If you are maintaining separate instruction files for each AI coding tool, you are doing unnecessary work. Here is what has changed and how to consolidate.

The Fragmentation Problem

By mid-2025, every major AI coding tool had invented its own instruction format:

Codex CLI: AGENTS.md
Claude Code: CLAUDE.md
Cursor: .cursorrules and .cursor/rules/
GitHub Copilot: .github/copilot-instructions.md
Gemini CLI: GEMINI.md
Windsurf: .windsurfrules

Teams using more than one tool — which is most teams — ended up maintaining multiple files with 80% overlapping content². Worse, instructions would drift between files, producing inconsistent agent behaviour across tools.

The Agentic AI Foundation

On 9 December 2025, the Linux Foundation announced the Agentic AI Foundation (AAIF), co-founded by OpenAI, Anthropic, and Block³. Three founding projects were contributed:

Model Context Protocol (MCP) — Anthropic’s universal tool integration standard
goose — Block’s open-source local-first AI agent framework
AGENTS.md — OpenAI’s specification for repository-level agent instructions

By February 2026, AAIF had grown to 146 members including JPMorgan Chase, American Express, Red Hat, Autodesk, Huawei, and UiPath, with David Nalley (AWS) appointed as governing board chair⁴.

graph TD
    A[Linux Foundation] --> B[Agentic AI Foundation - AAIF]
    B --> C[MCP
Anthropic]
    B --> D[AGENTS.md
OpenAI]
    B --> E[goose
Block]
    B --> F[146 Members
Feb 2026]
    F --> G[Gold: JPMorgan, Red Hat,
Autodesk, UiPath, ...]
    F --> H[Silver: 79 organisations]

Who Supports AGENTS.md Today

As of April 2026, over 25 tools read AGENTS.md natively¹:

Tool	Native Support	Auto-Loads	Tool-Specific File
Codex CLI	✅	✅	—
GitHub Copilot	✅	Config-dependent	`copilot-instructions.md`
Cursor	✅	✅	`.cursor/rules/`
Gemini CLI / Jules	✅	✅	`GEMINI.md`
Windsurf	✅	✅	`.windsurfrules`
Amp	✅	✅	—
Devin	✅	✅	—
Aider	✅	✅	—
OpenCode	✅	✅	—
goose	✅	✅	—
JetBrains Junie	✅	✅	—
Zed	✅	✅	—
Warp	✅	✅	—
Factory	✅	✅	—

Claude Code remains the notable exception: it auto-loads CLAUDE.md but does not natively read AGENTS.md⁵. A symlink or include workaround is required (see below).

The Specification

AGENTS.md is deliberately minimal. It is a plain Markdown file with no required YAML front matter, no version field, and no schema¹. This simplicity is the point — it lowers the adoption barrier and ensures every Markdown-capable tool can parse it.

Recommended Sections

# Project Overview
Brief description of the codebase and its architecture.

## Build and Test Commands
- `npm run build` — production build
- `npm test` — unit tests via Vitest
- `npm run lint` — ESLint + Prettier

## Code Style
- TypeScript strict mode, no `any`
- Prefer `interface` over `type` for object shapes
- Use named exports

## Security Boundaries
- Never commit `.env` files
- All SQL must use parameterised queries
- No shell command construction from user input

## Git Workflow
- Branch naming: `feat/`, `fix/`, `chore/`
- Squash merges to main
- Conventional Commits format

Hierarchy Rules

AGENTS.md supports nested placement in monorepos. The agent reads the file closest to the file being edited, with explicit user prompts overriding everything¹:

repo-root/
├── AGENTS.md                  # Global rules
├── packages/
│   ├── api/
│   │   └── AGENTS.md          # API-specific overrides
│   └── web/
│       └── AGENTS.md          # Frontend-specific overrides

Codex CLI additionally supports a global ~/.codex/AGENTS.md for personal defaults that apply across all repositories⁶.

The Cross-Tool Strategy

The practical recommendation from the community is the 80/20 rule: write 80% of your instructions in AGENTS.md, then maintain tool-specific files only for features that require them².

What Goes in AGENTS.md

Everything that is tool-agnostic:

Build, test, and lint commands
Code style and conventions
Architecture overview
Security boundaries
Git workflow rules
Domain vocabulary

What Stays in Tool-Specific Files

Features unique to a specific tool:

CLAUDE.md: MCP server configuration, Claude-specific slash commands
.cursor/rules/: MDC-format files with YAML front matter for glob-based activation scoping⁵
Codex config.toml: Sandbox modes, approval policies, model selection, hooks — these are runtime configuration, not agent instructions

Claude Code Workaround

Since Claude Code does not yet read AGENTS.md natively, the cleanest approach is a symlink:

ln -s AGENTS.md CLAUDE.md

Or, if you need Claude-specific additions, create a CLAUDE.md that references the shared file:

# Claude Code Instructions

See AGENTS.md for project conventions. Additional Claude-specific notes below.

## MCP Servers
- Use the `filesystem` MCP server for large directory traversals

Practical Migration: Consolidating Your Files

If your repository currently maintains multiple instruction files, here is the migration path:

flowchart LR
    A[.cursorrules] --> D[Extract common
instructions]
    B[CLAUDE.md] --> D
    C[copilot-instructions.md] --> D
    D --> E[AGENTS.md
Single source of truth]
    E --> F[Symlink or
thin wrappers]
    F --> G[CLAUDE.md
Claude-specific only]
    F --> H[.cursor/rules/
Glob-scoped only]

Audit existing files and highlight overlapping instructions
Extract common content into a single AGENTS.md at the repository root
Reduce tool-specific files to only their unique features
Symlink where possible — CLAUDE.md → AGENTS.md works well when content is 90%+ shared
Test by running each tool against the same task and comparing output quality

Size Guidelines

Keep AGENTS.md under 500 lines². Beyond that, you hit context window consumption issues — the file is injected into every prompt. For larger projects, use the monorepo hierarchy pattern rather than a single enormous root file. This aligns with the findings in the ETH Zurich study on context pollution, where oversized instruction files degraded agent performance⁷.

Enterprise Considerations

For organisations adopting AGENTS.md as a cross-team standard:

Version control: AGENTS.md should be committed and never .gitignored — it is a team artefact, not a personal preference file²
CI validation: Lint your AGENTS.md in CI to catch formatting issues, overly long files, or missing sections. A simple wc -l AGENTS.md | awk '{if ($1 > 500) exit 1}' check suffices
Signed manifests: GitHub Copilot Enterprise supports GPG-signed context files to prevent tampering⁸ — ⚠️ this is an enterprise-only feature and not part of the AGENTS.md specification itself
Audit logging: Enterprise tools increasingly log which context files were loaded per session, useful for compliance

What This Means for Codex CLI Users

If you are already using Codex CLI with AGENTS.md, you are already on the standard. The key takeaway is that your AGENTS.md now works across the entire ecosystem — there is no Codex-specific dialect. When a colleague opens the same repository in Cursor, Gemini CLI, or Copilot, they get the same baseline instructions.

The Codex-specific configuration — sandbox modes, approval policies, model selection, hooks, profiles — belongs in config.toml and .codex/agents/ TOML files, not in AGENTS.md. This separation is clean: AGENTS.md is the what (project conventions), config.toml is the how (runtime behaviour).

Looking Ahead

With 146 member organisations and adoption across 60,000+ repositories, AGENTS.md has crossed the threshold from convention to standard. The AAIF governance model — the same structure that governs Kubernetes, Node.js, and Linux itself — provides the stability that enterprise teams require before committing to a specification³.

The remaining gap is Claude Code’s lack of native support. Given that Anthropic co-founded AAIF, native AGENTS.md reading in Claude Code seems likely — but until it ships, the symlink workaround remains necessary.

For senior developers managing multi-tool workflows, the action is straightforward: consolidate into a single AGENTS.md, keep it under 500 lines, and let the tools converge around you.

Citations

AGENTS.md — Official specification site, accessed April 2026. Lists 25+ supported tools and 60,000+ project adoption. ↩ ↩² ↩³ ↩⁴
AGENTS.md Cross-Tool Unified Management Guide — SmartScope, February 2026. Recommends the 80/20 shared-base approach and 500-line maximum. ↩ ↩² ↩³ ↩⁴
Linux Foundation Announces the Formation of the Agentic AI Foundation (AAIF), 9 December 2025. Co-founded by OpenAI, Anthropic, and Block with MCP, goose, and AGENTS.md as founding projects. ↩ ↩²
Agentic AI Foundation Welcomes 97 New Members — Linux Foundation, 24 February 2026. Total membership reached 146; David Nalley appointed governing board chair. ↩
AGENTS.md vs CLAUDE.md: A Practical Guide — The Prompt Shelf, 2026. Details cross-tool behaviour differences and Claude Code’s AGENTS.md gap. ↩ ↩²
Custom instructions with AGENTS.md — Codex CLI official documentation, accessed April 2026. Documents hierarchy: global → repo root → subdirectory. ↩
The AGENTS.md Bloat Problem — Codex Resources, 27 March 2026. ETH Zurich study findings on context pollution from oversized instruction files. ↩
AI CLI Standardization: From Tool Lock-in to Portability — msbiro.net, 2026. Covers signed manifests, enterprise security features, and audit logging for context files. ↩

How the Codex CLI Agentic Loop Works in Detail to the Code Level

2026-04-07T00:00:00+01:00

How the Codex CLI Agentic Loop Works in Detail to the Code Level

Every time you type a prompt into Codex CLI, a carefully orchestrated machinery of Rust async tasks, streaming API calls, tool dispatchers, and OS-level sandboxes springs into action. This article traces the complete lifecycle of a single turn through the Codex CLI codebase — from keystroke to committed code — referencing the actual crate structure, key source files, and design decisions that make it work.

The Cargo Workspace at a Glance

Codex CLI ships as a single binary compiled from a Cargo workspace of approximately 84 member crates¹. The crates that matter most for understanding the agentic loop are:

Crate	Responsibility
`codex-core`	Session management, model API communication, tool orchestration
`codex-protocol`	Shared wire types (`Op`, `EventMsg`, items)
`codex-tui`	Interactive terminal UI (Ratatui-based)
`codex-exec`	Headless non-interactive execution (`codex exec`)
`codex-cli`	Multitool dispatcher routing subcommands
`codex-config`	Layered configuration with validation

The binary entry point lives in codex-cli, which delegates to either codex-tui (interactive) or codex-exec (headless) after parsing arguments¹.

The Submission/Event Architecture

Codex decouples its user interface from the agent engine using an asynchronous submission/event queue pattern¹. Two primitives define the contract:

Codex::submit(Op) — clients push operations (user turns, approvals, interrupts) wrapped in Submission envelopes carrying unique IDs and optional W3C trace context for distributed tracing.
Codex::next_event() — the engine emits EventMsg notifications (message deltas, tool status updates, approval requests) back to the UI.

This separation means the TUI, the exec harness, and the app-server for IDE integration all consume the same event stream. The submission_loop runs as a dedicated Tokio task, ensuring linearised state changes whilst supporting concurrent event processing across multiple client connections¹.

sequenceDiagram
    participant User as User / IDE
    participant Sub as Codex::submit()
    participant Loop as submission_loop (Tokio task)
    participant Ctx as ContextManager
    participant API as Responses API (SSE)
    participant Tools as ToolRouter
    participant Evt as Codex::next_event()

    User->>Sub: Op::UserTurn(prompt)
    Sub->>Loop: Submission { id, op, trace_ctx }
    Loop->>Ctx: Record user input, build prompt
    Ctx->>API: POST /v1/responses (streaming)
    API-->>Loop: SSE: response.output_text.delta
    Loop-->>Evt: EventMsg::TextDelta
    API-->>Loop: SSE: response.output_item.added (tool_call)
    Loop->>Tools: Dispatch tool call
    Tools-->>Loop: Tool result
    Loop->>Ctx: Append result to history
    Ctx->>API: POST /v1/responses (continuation)
    API-->>Loop: SSE: response.completed
    Loop-->>Evt: EventMsg::TurnComplete
    Evt-->>User: Render final output

Thread and Turn Semantics

Codex models conversations as a hierarchy of Threads and Turns¹:

A Thread is a persistent conversation backed by SQLite (StateDB). Threads survive process restarts and can be resumed, forked, archived, or rolled back.
A Turn is one round-trip cycle: user input triggers model inference, which may produce tool calls whose results feed back into the model until a final assistant message appears.
Items are granular events within a turn — agent messages, shell output, file edits, reasoning traces.

The ThreadManager orchestrates multiple CodexThread instances (a primary agent plus any sub-agents), each maintaining its own ContextManager for message history and token accounting¹.

Prompt Assembly and the Responses API

Each turn begins with the ContextManager assembling a prompt for the OpenAI Responses API. The prompt structure follows a strict ordering to maximise cache hits²:

System message — general rules, coding standards
Tools — conforming to the Responses API tool schema
Developer instructions — from config.toml, AGENTS.md, AGENTS.override.md, and skill-based instructions (subject to a 32 KiB default limit)²
Input sequence — the full conversation history (text, images, file inputs, tool results)

Codex deliberately avoids the previous_response_id parameter despite the apparent inefficiency of resending the full history each time. This design choice ensures every request is stateless, enabling Zero Data Retention (ZDR) compliance for enterprise customers who reject server-side data storage².

The API is called via one of three endpoints depending on authentication²:

Auth Method	Endpoint
ChatGPT login	`chatgpt.com/backend-api/codex/responses`
API key	`api.openai.com/v1/responses`
Local/OSS models	`localhost:11434/v1/responses` (with `--oss`)

Responses stream back as Server-Sent Events (SSE): response.output_text.delta events drive incremental UI rendering, whilst response.output_item.added events signal tool call requests requiring dispatch².

Tool Dispatch: The ToolRouter

When the model emits a tool call, the ToolRouter (in codex-core) classifies and dispatches it to one of three execution backends¹:

Built-in Shell Tools

Shell commands route through the UnifiedExecProcessManager, which manages PTY allocation and long-running process lifecycle. The system prompt teaches a shell-first toolkit — cat for reading, grep/find for searching, test runners and linters for verification — reserving file mutation for the dedicated apply_patch envelope³.

The apply_patch System

File modifications use a structured patch format rather than raw shell writes. The binary supports a special invocation mode: when arg1 is --codex-run-as-apply-patch, the process acts as a virtual patch CLI⁴. This ensures all file edits pass through a validated, diffable pathway rather than unconstrained shell writes.

MCP Server Integration

External tools (database queries, API calls, custom integrations) are accessed via the Model Context Protocol. The McpConnectionManager maintains lifecycle management for MCP servers over stdio or HTTP bridges, routing tool calls through the same approval and sandbox policy as built-in tools¹.

flowchart TD
    TC[Model emits tool_call] --> TR{ToolRouter}
    TR -->|Shell command| APR[Approval Gate]
    TR -->|File edit| APR
    TR -->|MCP tool| APR
    APR -->|Approved| SB{Sandbox Policy}
    APR -->|Denied| DENY[Return denial to model]
    SB -->|DangerFullAccess| EXEC[Execute unrestricted]
    SB -->|WorkspaceWrite| WS[Execute with write ACL]
    SB -->|ReadOnly| RO[Execute read-only]
    EXEC --> RES[Append result to history]
    WS --> RES
    RO --> RES
    RES --> CTX[ContextManager continuation]
    CTX --> API[Next Responses API call]

The Approval Gate State Machine

Before any tool executes, it passes through an approval gate governed by the AskForApproval enum¹⁵:

Mode	Behaviour
`UnlessTrusted`	Auto-approves safe read-only operations; prompts for writes and network access
`OnRequest`	The model itself decides when to request user consent
`Never`	No prompts — used in non-interactive `codex exec` modes

These map to the user-facing approval modes⁵:

Auto (default) — reads and workspace-scoped edits proceed; out-of-scope writes and network access require confirmation.
Read-only — consultative mode; all mutations require explicit approval.
Full Access — unrestricted; use sparingly with trusted repositories.

Approval state persists across session resumption via SQLite StateDB, so resuming a thread retains the user’s previous policy decisions¹.

Sandbox Lifecycle: Landlock, Seatbelt, and arg0 Dispatch

The sandbox is Codex CLI’s most distinctive architectural feature — enforcement happens at the kernel level, not the application layer⁶.

Platform-Specific Backends

Platform	Mechanism	Implementation
Linux	Landlock LSM (+ optional Bubblewrap pipeline)	`codex-linux-sandbox` binary alias
macOS	Seatbelt sandbox profiles	Confined mode via `sandbox-exec`
Windows	Restricted token elevation	Via WSL2

The arg0 Dispatch Pattern

The entry point wraps the main function in arg0_dispatch_or_else()⁴. This function inspects the binary name at invocation time:

If invoked as codex-linux-sandbox, it immediately executes a sandboxed command using Landlock restrictions without parsing regular CLI arguments.
Otherwise, it loads environment variables, patches PATH, and proceeds to normal CLI logic — but crucially, it passes the sandbox executable path downstream so codex-core can re-invoke itself with restrictions when executing tool calls.

This self-referential dispatch pattern means the sandbox helper is embedded within the same binary rather than requiring a separate sidecar process⁴.

Sandbox Policies

Three policy levels control what the sandbox permits¹:

DangerFullAccess — unrestricted filesystem and network access.
WorkspaceWrite — write access limited to the current working directory and explicitly specified roots.
ReadOnly — filesystem read-only to allowed directory roots.

Every tool call flows through a centralised execution system in the ToolOrchestrator that selects the appropriate sandbox based on the current approval mode and the tool’s risk classification⁴. You can test sandbox behaviour directly using codex debug seatbelt or codex debug landlock⁴.

Context Window Management and Compaction

With GPT-5.4’s 1M token context window⁷, Codex can sustain long sessions — but history still grows, and the entire conversation is included in every request². Two strategies keep this manageable:

Prompt Caching

Codex structures prompts so that static content (system instructions, tool definitions) occupies the prefix and variable content (conversation history) appends to the end. With cache hits, sampling cost becomes linear rather than quadratic². Empirical measurements show⁸:

Scenario	Cache Hit Rate	Median TTFT	Cost per Request
Stable prefixes	85%	953 ms	$0.009
Perturbed prefixes	0%	2,727 ms	$0.033

That is a 65% latency reduction and 71% cost reduction from prefix consistency alone.

Cache misses are triggered by mid-conversation configuration changes: tool availability modifications, model switching, sandbox reconfiguration, approval mode changes, or working directory updates².

Automatic Compaction

Token tracking lives in codex-rs/core/src/context_manager/history.rs. The estimate_response_item_model_visible_bytes() function serialises items and applies byte-to-token heuristics, with Session::recompute_token_usage() in codex.rs calling ContextManager::estimate_token_count() to maintain running totals⁹.

When usage exceeds model_auto_compact_token_limit (approximately 95% of the effective window — around 180K–244K tokens depending on the model), auto-compaction triggers⁹. The process, implemented in codex-rs/core/src/compact.rs¹⁰:

The full conversation history is sent to the /responses/compact endpoint with a dedicated summarisation prompt.
The server generates a structured summary and returns it AES-encrypted⁸. The encryption keys remain server-side, preventing clients from inspecting or tampering with summaries.
Write tools are blocked before compaction triggers to prevent mid-refactoring conflicts⁸.
The session rebuilds context as: initial prompt + recent user messages (~20K tokens) + the encrypted summary blob.
On subsequent requests, OpenAI’s servers decrypt the blob and inject it with a handoff prompt before feeding context to the model.

The implementation includes retry logic with exponential backoff for failed compactions, and warns that “long conversations and multiple compactions can cause the model to be less accurate”¹⁰. Users can also trigger compaction manually via the /compact slash command.

flowchart LR
    A[Token count exceeds threshold] --> B[Block write tools]
    B --> C[Send history to /responses/compact]
    C --> D[Server generates AES-encrypted summary]
    D --> E[Rebuild context: prefix + recent msgs + blob]
    E --> F[Resume normal operation]
    F --> G[Server decrypts blob on next request]

The App Server: JSON-RPC for IDE Integration

For IDE integration (VS Code, Cursor, JetBrains), the codex-api crate exposes a JSON-RPC 2.0 interface over stdio (JSONL)¹¹¹. The server comprises four main components:

Stdio reader — parses incoming JSON-RPC calls
CodexMessageProcessor — translates between wire protocol and internal types
Thread manager — creates, resumes, and forks threads
Core threads — the actual CodexThread instances running the agentic loop

The EventMsg notifications from the core are translated into JSON-RPC notifications, enabling IDEs to render streaming output, display approval prompts, and show tool execution status in real time¹¹.

Session Persistence and Rollout Files

Every session is persisted as compressed JSONL (.jsonl.zst) files in ~/.codex/sessions/ organised by date¹. The RolloutRecorder filters events based on persistence mode and writes timestamped files enabling:

Resumption — replay events to restore conversation state
Forking — branch a conversation at any point
Audit trail — complete operational history for compliance

Each rollout file contains session metadata and serialised event items sufficient for full reconstruction¹.

Error Recovery

When tool execution fails, the error output is appended to the conversation history and fed back to the model as a tool result. The model then reasons about the failure and decides whether to retry with a modified approach, try an alternative strategy, or report the failure to the user. This is not explicit retry logic in the orchestrator — rather, the model’s own reasoning drives recovery, consistent with the ReAct pattern².

Compaction failures are the exception: compact.rs implements explicit retry with exponential backoff before falling back to continued operation with the uncompacted history¹⁰.

Comparative Architecture: Claude Code

For context, Claude Code takes a fundamentally different approach to several of these concerns⁷:

Sandbox: Application-layer hooks with 17 lifecycle event types (e.g., PreToolUse on Bash) rather than kernel-level enforcement.
Context: 200K token window (vs. Codex’s 1M) compensated by codebase retrieval and cascading CLAUDE.md hierarchy.
Multi-agent: Interactive subagent spawning via Task tool with real-time synthesis, versus Codex’s fire-and-forget cloud delegation supporting up to 6 concurrent threads.

Both approaches are valid — Codex optimises for security-first isolation and large-context reasoning; Claude Code optimises for flexible programmable hooks and retrieval-augmented generation.

Key Takeaways

The Codex CLI agentic loop is not a simple prompt-response cycle. It is a production-grade async runtime with kernel-level sandboxing, encrypted context compaction, stateless API design for ZDR compliance, and a self-referential binary that re-invokes itself to enforce sandbox restrictions. Understanding these internals is essential for anyone building custom harnesses, debugging unexpected behaviour, or extending Codex through MCP servers and skills.

Citations

Codex CLI Competitive Position April 2026: The Road to Parity with Claude Code

2026-04-07T00:00:00+01:00

Codex CLI Competitive Position April 2026: The Road to Parity with Claude Code

The AI coding agent market has consolidated rapidly. Three products — Claude Code, GitHub Copilot, and Cursor — now control over 70% of a market worth an estimated $4 billion annually¹. Codex CLI, backed by GPT-5.3-Codex and a thriving open-source community, sits firmly in Tier 1 alongside Claude Code. This article examines where Codex CLI stands in April 2026, where it leads, where it trails, and whether the parity trajectory holds.

Market Landscape: The April 2026 Tier List

TokenCalculator’s April 2026 ranking divides the field into three tiers²:

Tier	Tool	Positioning
Tier 1 — Leaders	Claude Code (Anthropic)	Best agentic reasoning, largest context window
	OpenAI Codex (CLI + App)	Best sandbox, background agents, open-source CLI
Tier 2 — Strong Contenders	Cursor 3	Best interactive IDE experience
	GitHub Copilot	Enterprise distribution, Microsoft integration
Tier 3 — Falling Behind	Google Antigravity	Promising launch, stalled roadmap
	Windsurf (Cognition)	Niche positioning

Claude Code dominates developer sentiment with a 46% “most loved” rating versus 19% for Cursor and just 9% for Copilot³. It has captured 41% market share among professional developers, overtaking Copilot’s 38% in barely eight months since launch³. In the agentic coding subcategory specifically, 71% of developers who regularly use AI agents use Claude Code³.

Codex, meanwhile, has grown to over 2 million weekly active users as of March 2026, with token throughput up fivefold since the GPT-5.3-Codex launch in February⁴. Enterprise adoption includes Cisco, Nvidia, Ramp, Rakuten, and Harvey⁴.

Benchmark Comparison: Specialisation, Not Supremacy

The benchmarks tell a nuanced story of specialisation rather than outright dominance by either tool⁵:

Benchmark	GPT-5.3-Codex	Opus 4.6 (Claude)	Winner
SWE-Bench Pro	56.8%	—	—
SWE-Bench Verified	80.0% (GPT-5.2)	80.8%	Claude (marginal)
Terminal-Bench 2.0 (model)	75.1%	65.4%	Codex
Terminal-Bench 2.0 (framework)	77.3%	69.9%	Codex
OSWorld-Verified	64.7%	72.7%	Claude
GDPval-AA (knowledge work)	—	+144 Elo	Claude

GPT-5.3-Codex leads decisively on terminal and CLI automation tasks — the bread and butter of Codex CLI’s design philosophy⁵⁶. Opus 4.6 leads on GUI automation, knowledge work, and the headline SWE-Bench Verified metric⁵. The gap on SWE-Bench Verified is vanishingly small (0.8 percentage points), but Claude Code’s advantage on complex reasoning tasks remains meaningful.

Direct comparison is complicated by reporting differences: OpenAI publishes SWE-Bench Pro scores whilst Anthropic reports Verified scores, making like-for-like analysis difficult⁵.

Where Codex CLI Leads

Kernel-Level Sandboxing

Codex CLI’s security model is architecturally distinct. On Linux, it uses bubblewrap with seccomp filters and Landlock LSM for filesystem isolation. On macOS, it enforces Seatbelt policies via sandbox-exec⁷. Network access is disabled by default, significantly reducing prompt injection and data exfiltration risks⁷.

# Full-auto mode with kernel sandbox — no approval gates
codex --full-auto "refactor auth module to use JWT"

# The sandbox restricts:
# - Network access (disabled by default)
# - Filesystem access (workspace only)
# - Process spawning (filtered by seccomp)

Claude Code, by contrast, relies on application-layer hooks for security⁸. For regulated industries and CI/CD pipelines, Codex CLI’s OS-enforced isolation is a genuine differentiator.

Token Efficiency

GPT-5.3-Codex uses approximately 4x fewer tokens than Claude Code for equivalent tasks⁸. At scale, this translates directly to cost savings. For the 80% of solo developers doing moderate daily work, Codex CLI at $20/month is better value per dollar².

Background Agents and Cloud Execution

Codex’s background agent model — define a task, hand it off, review the branch later — is a genuine workflow innovation². The sandboxed cloud execution environment produces PR-ready output that is polished and production-ready².

Open-Source Community

Codex CLI is Apache 2.0 licensed with 67,000+ GitHub stars and 400+ contributors⁸. This has spawned a healthy fork ecosystem, most notably Every Code (just-every/code, 3,700+ stars), which adds multi-model orchestration across OpenAI, Claude, and Gemini providers, browser integration, Auto Drive multi-agent automation, and background auto-review via ghost-commit watchers⁹.

Where Claude Code Leads

Context Window and Multi-File Reasoning

Opus 4.6 offers a 200K standard context window with a 1M-token beta, compared to GPT-5.3-Codex’s 400K standard⁵. ⚠️ Effective context utilisation varies by task, and raw window size is not always the binding constraint. However, for large monorepo refactoring — where changes cascade across frontend, backend, database, and test layers — Claude Code’s ability to hold more context and reason about complex interactions gives it a measurable edge¹⁰.

Implicit Convention Understanding

Claude Code demonstrates stronger understanding of implicit project conventions — coding styles, architectural patterns, and team-specific idioms that are not explicitly documented². This “naturalness” in tool usage patterns makes it feel more like a senior pair programmer and less like a script executor.

Agent Coordination

Claude Code’s Agent Teams feature enables direct agent-to-agent communication for parallel task execution¹⁰. Codex CLI supports subagents for task parallelisation but lacks equivalent cross-agent coordination¹⁰. For orchestrating complex, multi-step workflows that require handoffs between specialised agents, Claude Code is ahead.

The Cursor 3 Factor

Cursor 3 launched on 2 April 2026 with a fundamental architectural pivot from IDE-with-AI to agent-first workspace¹¹. The new Agents Window provides a centralised command hub for managing multi-step, autonomous tasks. Key capabilities include:

Parallel cloud agents for simultaneous task execution
Multi-repo support with seamless local/cloud handoff
Design Mode for visual development workflows
Integrated browsing, plugin, and PR tooling¹¹

graph LR
    A[Developer] --> B{Primary Workflow}
    B -->|Complex reasoning
Multi-file refactors| C[Claude Code]
    B -->|Autonomous batch work
CI/CD, DevOps| D[Codex CLI]
    B -->|Interactive editing
Visual development| E[Cursor 3]
    C --> F[Production Branch]
    D --> F
    E --> F

The strategic significance is that Cursor’s pivot validates the agentic model that Claude Code and Codex CLI pioneered. Cursor 3 comes as Claude Code reportedly holds 54% of the agentic coding market¹², suggesting Cursor is playing catch-up in this segment whilst leveraging its IDE-native advantage.

The Parity Trajectory

TokenCalculator’s analysis suggests Codex could pull even with Claude Code by mid-2026 if current trends continue². Several factors support this:

Model velocity: GPT-5.3-Codex is 25% faster than its predecessor with fewer tokens consumed⁶. GPT-5.4 has already been announced¹³, suggesting rapid iteration continues.
Adoption momentum: From 1 million downloads to 2 million weekly active users in under two months⁴.
Enterprise traction: Named enterprise deployments at Cisco, Nvidia, and others signal institutional confidence⁴.
Open-source moat: The fork ecosystem (Every Code, Open Codex, and others) creates a gravitational pull that proprietary tools cannot replicate.

Against parity, several structural advantages favour Claude Code:

Reasoning depth: The GDPval-AA Elo gap (+144) reflects genuine architectural differences in reasoning capability⁵.
Market momentum: 41% market share and $2.5 billion ARR provide resources for rapid iteration³.
Developer love: A 46% “most loved” rating creates retention that is difficult to overcome³.

graph TD
    A[Q1 2026: Claude Code leads] --> B[Q2 2026: Projected convergence zone]
    B --> C{Mid-2026 outcome}
    C -->|Codex catches up| D[Parity: specialisation-based market split]
    C -->|Claude maintains gap| E[Duopoly: Claude for quality, Codex for efficiency]
    C -->|Cursor disrupts| F[Three-way race with IDE-native advantage]

Practical Guidance

For teams choosing today, the data supports a multi-tool strategy:

Workflow	Recommended Tool	Rationale
Autonomous background tasks	Codex CLI (`--full-auto`)	Kernel sandbox, token efficiency, PR-ready output
Complex multi-file refactors	Claude Code	Larger context, stronger cross-file reasoning
Interactive development	Cursor 3	IDE-native experience, parallel agents
CI/CD pipeline integration	Codex CLI (`codex exec`)	OS-level isolation, deterministic execution
Enterprise with Microsoft stack	GitHub Copilot	Distribution, compliance, SSO integration

The “best developers use both” pattern identified by multiple analysts⁸ is not a hedge — it reflects genuine specialisation in the tools. Codex CLI’s Unix-philosophy approach (do one thing well, in a sandbox, with maximum efficiency) complements Claude Code’s deep-reasoning, convention-aware approach.

What to Watch

GPT-5.4’s coding benchmarks: Will the next model close the SWE-Bench Verified and OSWorld gaps?
Codex CLI Agent Teams equivalent: Cross-agent coordination is the most significant feature gap.
Every Code’s trajectory: If the fork ecosystem consolidates around multi-model orchestration, it could reshape the competitive dynamics entirely.
Google Antigravity: Three months of silence after a promising January launch. Either a pivot is coming or the product is being deprioritised.

Citations

Codex CLI Diagnostic Toolkit: Tracing, Sandbox Testing, and the Built-In Debugging Commands

2026-04-07T00:00:00+01:00

Codex CLI Diagnostic Toolkit: Tracing, Sandbox Testing, and the Built-In Debugging Commands

Codex CLI ships with a surprisingly deep set of diagnostic tools that most developers never discover. When an agent session stalls, a sandbox blocks a legitimate command, or a config key silently fails to take effect, knowing how to reach for RUST_LOG, codex sandbox, or /debug-config can save hours of guesswork. This article is a systematic reference to every built-in diagnostic surface in Codex CLI as of v0.118.0.

The Diagnostic Surface Area

Codex CLI’s diagnostic capabilities span four layers: runtime tracing via environment variables, interactive slash commands inside the TUI, standalone CLI subcommands for offline testing, and post-session analysis via JSONL rollout files.

graph TD
    A[Codex CLI Diagnostics] --> B[Runtime Tracing]
    A --> C[TUI Slash Commands]
    A --> D[Standalone Subcommands]
    A --> E[Post-Session Analysis]

    B --> B1["RUST_LOG env var"]
    B --> B2["LOG_FORMAT=json"]
    B --> B3["OpenTelemetry export"]

    C --> C1["/status"]
    C --> C2["/debug-config"]
    C --> C3["/feedback"]

    D --> D1["codex sandbox"]
    D --> D2["codex execpolicy check"]
    D --> D3["codex debug"]
    D --> D4["codex login status"]

    E --> E1["JSONL rollout files"]
    E --> E2["codex-tui.log"]

Runtime Tracing with RUST_LOG

Since Codex CLI is built in Rust atop the standard tracing crate¹, the RUST_LOG environment variable controls verbosity at module granularity. The default level for Codex crates is info².

Basic Usage

# Global debug logging
RUST_LOG=debug codex

# Trace-level logging (extremely verbose)
RUST_LOG=trace codex

# Debug logging in non-interactive mode
RUST_LOG=debug codex exec "refactor the auth module"

Module-Targeted Tracing

The real power lies in per-module targeting. Codex’s Rust workspace exposes several key tracing targets²:

# Debug the core agent loop while keeping everything else at info
RUST_LOG=info,codex_core=debug codex

# Trace shell command execution specifically
RUST_LOG=codex_exec=trace,codex_core=debug codex

# Debug sandbox behaviour
RUST_LOG=codex_sandbox=debug,codex_process_hardening=debug codex

# Trace API request/response details
RUST_LOG=codex_core::api=trace codex

# Debug MCP server connections
RUST_LOG=codex_core::mcp=debug codex

# Trace configuration resolution
RUST_LOG=codex_core::config=trace codex

# Trace authentication flows
RUST_LOG=codex_core::auth=trace codex

Structured Log Output

For machine-parseable logs — useful when piping into log aggregation — set the format to JSON²:

RUST_LOG=debug LOG_FORMAT=json codex exec "run tests" 2>&1 | tee codex-debug.log

The compact format is also available via RUST_LOG_FORMAT=compact².

Log File Locations

Codex writes TUI logs to ~/.codex/log/codex-tui.log, with automatic rotation³. In codex exec mode, timestamped log files appear at ~/.codex/logs/codex-tui-.log². These can be safely deleted when no longer needed, but they are invaluable for post-mortem debugging.

# Monitor logs in real time during a session
tail -f ~/.codex/logs/codex-tui-*.log

⚠️ Performance warning: Debug and trace levels can reduce throughput by 10–50%². Reserve them for active troubleshooting, not production workflows.

TUI Slash Commands for Live Diagnostics

Three slash commands provide in-session diagnostic information without leaving the TUI.

/status — Session Overview

The /status command displays the current session configuration and token usage⁴. This is your first stop when something feels off — it confirms which model is active, the current reasoning effort level, token consumption, and the effective sandbox mode.

/debug-config — Configuration Layer Diagnostics

When a config key appears to have no effect, /debug-config reveals the full configuration resolution stack⁵. It prints:

Layer order (lowest to highest precedence)
The effective value of each key and which layer set it
Policy details: allowed_approval_policies, allowed_sandbox_modes, mcp_servers, rules, enforce_residency, and experimental_network

This is particularly useful in enterprise environments where requirements.toml may silently override your config.toml settings⁵. If your sandbox_mode = "danger-full-access" is being ignored, /debug-config will show you that a managed policy is enforcing workspace-write.

/feedback — Structured Bug Reports

The /feedback command collects diagnostic information and submits it directly to OpenAI’s maintainers³. When invoked, it captures:

Request ID (essential for OpenAI support tickets)
Session ID
Connection status (connected/reconnecting/disconnected)
Last error message
Active tools count
MCP server connection status

Always run /feedback before closing a session that exhibited unexpected behaviour — the request ID is the single most useful datum when filing issues on GitHub³.

The codex sandbox Subcommand

The codex sandbox subcommand⁶ lets you test arbitrary commands under the exact same sandbox enforcement that Codex applies during agent sessions — without starting an agent session. This is indispensable when diagnosing why a build tool or test runner fails under sandboxing.

Platform-Specific Syntax

# macOS — test a command under Seatbelt enforcement
codex sandbox macos -- npm run build

# macOS — with full-auto permissions and denial logging
codex sandbox macos --full-auto --log-denials -- cargo test

# Linux — test under Landlock/bubblewrap enforcement
codex sandbox linux -- pytest tests/

# Linux — full-auto mode (workspace-write equivalent)
codex sandbox linux --full-auto -- make install

# Windows — test under restricted token enforcement
codex sandbox windows --full-auto -- dotnet test

The --log-denials flag on macOS is particularly valuable: it prints every Seatbelt denial to stderr, showing exactly which filesystem path or network operation was blocked⁶.

Legacy Aliases

The older codex debug seatbelt and codex debug landlock commands still work as aliases⁷:

# These are equivalent:
codex sandbox macos -- ls /etc
codex debug seatbelt -- ls /etc

Practical Use: Diagnosing Build Failures

A common scenario: your Rust project builds fine outside Codex but fails under the agent’s sandbox. Use codex sandbox to isolate the issue:

# Step 1: Test the build under sandbox
codex sandbox linux -- cargo build 2>&1 | grep -i denied

# Step 2: If failures appear, try with full-auto (workspace-write)
codex sandbox linux --full-auto -- cargo build

# Step 3: If it still fails, the issue is network access
# (e.g., crates.io downloads blocked by sandbox)

This workflow avoids the cost of starting a full agent session just to debug sandbox restrictions.

Platform Implementation Details

On macOS 12+, codex sandbox invokes Apple’s Seatbelt framework via /usr/bin/sandbox-exec with a runtime-generated profile controlling filesystem and network access⁶. On Linux, the sandbox uses a dual-mode pipeline: Landlock LSM by default, or bubblewrap (vendored in codex-rs/vendor/bubblewrap/) when enabled via features.use_linux_sandbox_bwrap = true⁶. The bubblewrap path provides stronger isolation through PID namespace separation (--unshare-pid), network namespace isolation (--unshare-net), and seccomp filters⁶.

flowchart LR
    subgraph macOS
        A[codex sandbox macos] --> B[sandbox-exec]
        B --> C[Seatbelt profile]
        C --> D[Command runs isolated]
    end

    subgraph Linux
        E[codex sandbox linux] --> F{bwrap enabled?}
        F -->|Yes| G[bubblewrap]
        F -->|No| H[Landlock + seccomp]
        G --> I[Namespace isolation]
        H --> I
        I --> J[Command runs isolated]
    end

The codex execpolicy check Subcommand

Before deploying Starlark .rules files, validate them offline with codex execpolicy check⁸. This subcommand evaluates one or more rule files against a proposed command and reports the decision without executing anything.

# Test a command against your rules
codex execpolicy check \
  --pretty \
  --rules ~/.codex/rules/default.rules \
  -- gh pr view 7888 --json title,body,comments

The output shows:

Effective decision: the strictest severity across all matched rules (forbidden > prompt > allow)⁸
matchedRules: every rule whose prefix matched, with the exact matchedPrefix shown⁸

You can combine multiple rule files:

codex execpolicy check \
  --pretty \
  --rules ~/.codex/rules/default.rules \
  --rules .codex/rules/project.rules \
  -- rm -rf node_modules

Unit Tests in Rules Files

The match and not_match fields in prefix_rule() function as inline unit tests⁸. Codex validates these examples when it loads your rules — if a match example does not trigger the rule, or a not_match example does, loading fails. Always populate these fields:

prefix_rule(
    pattern = "rm -rf",
    decision = "forbidden",
    match = ["rm -rf /", "rm -rf node_modules"],
    not_match = ["rm file.txt", "rmdir empty"]
)

The codex debug Subcommand

The codex debug command is the entry point for lower-level debugging utilities⁷:

# List available debug subcommands
codex debug --help

# Test the V2 app-server protocol with a single message
codex debug app-server send-message-v2 "Hello, world"

The send-message-v2 subcommand initialises the app-server, starts a thread, sends a single user message, and streams all server notifications back to the terminal⁷. This is useful for verifying that the app-server protocol is functioning correctly without starting the full TUI.

Authentication Diagnostics

When sessions fail to start with authentication errors, two commands help isolate the issue:

# Check current auth state without triggering a login flow
codex login status

# Inspect the auth token file directly
cat ~/.codex/auth.json | jq '.expires_at'

The codex login status command reports whether you are authenticated, the method used (browser OAuth, device code, or API key), and whether the token is valid⁷. A common failure pattern is a corrupted or expired auth.json file — the fix is to run codex logout followed by codex login³.

OpenTelemetry Integration

For production observability beyond ad-hoc tracing, Codex CLI supports OpenTelemetry export via the [otel] config section⁹:

[otel]
enabled = true
endpoint = "http://localhost:4317"
sampling_ratio = 1.0
service_name = "codex-cli"

This exports spans covering API calls, tool invocations, and sandbox operations to any OTLP-compatible backend (Jaeger, Grafana Tempo, SigNoz)⁹. Environment variables OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_SERVICE_NAME also work².

⚠️ Note: codex exec does not yet export OTel metrics, and codex mcp-server mode has no telemetry support as of v0.118.0⁹.

Post-Session Analysis with JSONL Rollout Files

Every Codex session writes a JSONL rollout file to ~/.codex/sessions/¹⁰. These files contain RolloutItem events (SessionMeta, UserMessage, ResponseItem, EventMsg, ApprovalDecision) and are invaluable for understanding what happened during a session that went wrong.

# Find the latest session rollout
ls -t ~/.codex/sessions/*.jsonl | head -1

# Count tool calls in a session
cat ~/.codex/sessions/.jsonl | \
  jq 'select(.type == "ResponseItem") | .item.type' | \
  sort | uniq -c | sort -rn

# Extract all approval decisions
cat ~/.codex/sessions/.jsonl | \
  jq 'select(.type == "ApprovalDecision")'

The community codex-replay tool renders these JSONL files as browsable HTML, and the ccusage project provides daily and monthly cost reports parsed from rollout token counters¹⁰.

A Diagnostic Workflow Checklist

When something goes wrong, work through this sequence:

Check config: Run /debug-config to verify your settings are taking effect
Check auth: Run codex login status to rule out credential issues
Check sandbox: Use codex sandbox -- to test commands in isolation
Check rules: Use codex execpolicy check --pretty --rules -- to validate execution policies
Enable tracing: Restart with RUST_LOG=debug codex and monitor ~/.codex/log/codex-tui.log
Review the rollout: Inspect the JSONL session file for the failed session
File a report: Run /feedback to capture diagnostic context before closing

This top-down approach moves from cheap (no restart required) to expensive (restart with tracing), minimising disruption to your workflow.

Citations

How to Be a Codex CLI Forward Deployed Engineer

2026-04-07T00:00:00+01:00

How to Be a Codex CLI Forward Deployed Engineer

The forward deployed engineer (FDE) has become the most sought-after role in AI-native companies. Job postings for the position grew 800–1,000% through 2025¹, and in 2026, organisations like OpenAI, Anthropic, and Palantir continue to hire aggressively². For engineers who have mastered Codex CLI, the FDE path offers a natural career escalation — combining deep tool expertise with client-facing delivery in high-stakes enterprise environments.

This article examines what it means to specialise as a Codex CLI FDE: the workflows, the technical stack, and the career mechanics.

What a Forward Deployed Engineer Actually Does

An FDE embeds directly with enterprise customers to ship custom, production-grade solutions. OpenAI’s own FDE job descriptions specify that candidates will “lead complex end-to-end deployments of frontier models in production alongside OpenAI’s most strategic customers”³. Unlike a traditional software engineer building product features behind a PM layer, an FDE owns the full lifecycle:

flowchart LR
    A[Discovery & Scoping] --> B[Rapid Prototyping]
    B --> C[Production Hardening]
    C --> D[Deployment & Rollout]
    D --> E[Feedback to Product]
    E -->|Next engagement| A

The critical distinction is ownership scope. A solutions architect designs and demos pre-sale. A core engineer ships features for all users. An FDE builds and deploys the final solution for a specific customer, post-sale, and remains accountable for its production stability¹.

The Codex CLI FDE Technical Stack

An FDE specialising in Codex CLI needs to operate across three layers: the CLI itself, the harness/integration layer, and the enterprise infrastructure layer.

Layer 1: Codex CLI Mastery

At minimum, an FDE must be fluent in the full configuration surface. Codex CLI’s config.toml hierarchy — global (~/.codex/config.toml), project-scoped (.codex/config.toml), and enterprise-managed (requirements.toml) — is the foundation of every deployment⁴.

Key configuration areas an FDE works with daily:

# Custom model provider for client's LLM proxy
model = "gpt-5.1"
model_provider = "client-proxy"

[model_providers.client-proxy]
name = "Client LLM Gateway"
base_url = "https://llm-proxy.client.internal"
env_key = "CLIENT_API_KEY"
wire_api = "responses"

# Enterprise sandbox policy
sandbox_mode = "workspace-write"

[sandbox_workspace_write]
writable_roots = ["/home/developer/project"]
network_access = true

The approval_policy system — with granular controls for sandbox_approval, mcp_elicitations, skill_approval, and request_permissions — lets FDEs tune the autonomy level to match each client’s security posture⁴. Enterprise clients behind TLS-inspecting proxies require custom CA certificate configuration via SSL_CERT_FILE and related environment variables⁵.

Layer 2: Custom Harness Construction

The Codex app server exposes a JSON-RPC protocol that lets external tools drive the same agent loop used by the CLI and VS Code extension⁶. For FDEs, this is the integration point where Codex becomes part of a client’s existing toolchain.

Common harness patterns an FDE builds:

flowchart TB
    subgraph Client Infrastructure
        CI[CI/CD Pipeline]
        IDE[IDE Extension]
        WEB[Internal Web App]
    end
    subgraph Codex Harness
        AS[App Server - JSON-RPC]
        AL[Agent Loop]
        SB[Sandbox]
    end
    subgraph Models
        API[Responses API]
        PROXY[Client LLM Proxy]
    end
    CI --> AS
    IDE --> AS
    WEB --> AS
    AS --> AL
    AL --> SB
    AL --> API
    AL --> PROXY

The Python SDK enables programmatic access for embedding Codex into automation workflows, CI pipelines, and custom tooling⁵. An FDE building a client integration typically wires the app server into the client’s deployment pipeline, configures model routing through their LLM proxy, and sets up the hooks system for audit logging.

Layer 3: Enterprise Infrastructure

This is where 80% of the actual FDE work happens¹. Getting a demo working is straightforward; navigating corporate SSO, network policies, compliance requirements, and production credentials is the real challenge.

Enterprise deployment concerns an FDE handles:

Concern	Codex CLI Mechanism	FDE Responsibility
Authentication	ChatGPT device-code sign-in or API key auth⁵	Integrate with client IdP, configure `forced_login_method` and `forced_chatgpt_workspace_id`⁴
Network security	Configurable domain allowlists/denylists, SOCKS5 proxy support⁴	Map client firewall rules to `allowed_domains`, configure egress policies
Audit & compliance	Hooks system, OpenTelemetry export⁴	Wire into client SIEM, configure `otel` exporters with TLS certs
Cost management	Pay-as-you-go Codex seats for Business/Enterprise⁵	Model token budgets, `model_reasoning_effort` tuning
Device management	`requirements.toml` for managed machines⁴	Work with client MDM to distribute configuration profiles

The requirements.toml mechanism is particularly important — it lets an organisation enforce constraints such as disallowing approval_policy = "never" or sandbox_mode = "danger-full-access", ensuring that individual developers cannot bypass security policies⁴.

A Day in the Life

A typical FDE engagement follows a compressed timeline. Where a traditional project might take quarters, an FDE ships in weeks.

Week 1 — Discovery: Embed with the client engineering team. Map their existing development workflow. Identify where Codex CLI slots in — code generation, test authoring, migration automation, documentation. Set up a proof-of-concept with their model provider and network configuration.

Week 2 — Prototype: Build an AGENTS.md constitution tailored to their codebase conventions. Configure domain-expert agents in .codex/agents/ for their specific stack. Wire the app server into their CI pipeline for automated code review or test generation. Demo to stakeholders.

Week 3–4 — Production hardening: Lock down sandbox policies via requirements.toml. Configure OpenTelemetry export to their observability stack. Set up the hooks system for compliance audit trails. Load-test the app server under realistic concurrency. Train their team on prompt engineering patterns.

Ongoing — Feedback loop: Channel field insights back to the core product team. Identify feature gaps that affect multiple enterprise clients. Propose configuration additions or SDK improvements.

Skills Beyond the Terminal

OpenAI’s FDE postings require 7+ years of full-stack engineering experience, with customer-facing experience “highly desirable”³. The role demands travel — up to 50% for the NYC position³. Compensation at OpenAI and Anthropic ranges from $350K to $550K total compensation at mid-to-senior levels¹.

The skills profile is T-shaped¹:

Vertical depth: Codex CLI internals, Responses API, model behaviour, sandbox architecture, TOML configuration surface
Horizontal breadth: Customer empathy, problem decomposition in ambiguous environments, rapid prototyping under pressure, product sense for identifying patterns across clients

Technical interviewing for FDE roles typically includes a decomposition case study — receiving an ambiguous real-world problem and structuring a solution iteratively, not just solving a LeetCode problem¹. Palantir pioneered this format, and it has become industry-standard for FDE hiring¹.

From Codex User to FDE: The Career Path

The strongest FDE candidates come from backgrounds that combine building and deploying¹:

Early-stage startup engineers — accustomed to wearing multiple hats and shipping under pressure
Solutions architects who build PoCs — already comfortable in client-facing technical contexts
Platform/DevOps engineers — experienced with the infrastructure layer that consumes most FDE time
Power users of AI coding tools — deep familiarity with Codex CLI, Claude Code, or similar agentic tools

The progression typically runs: power user → internal champion (rolling out Codex CLI within your own organisation) → FDE candidate. Building a portfolio of custom harness integrations, AGENTS.md configurations, and enterprise deployment case studies is the most direct path⁷.

The Integration Wall

The FDE role exists because of what the industry calls the “integration wall”¹ — the gap between a powerful platform and an enterprise-ready deployment. Codex CLI is a sophisticated tool with a deep configuration surface, but every enterprise has unique network policies, compliance requirements, model provider preferences, and development workflows.

No amount of documentation closes that gap entirely. Someone has to sit with the client, understand their constraints, and build the bridge. That someone is the FDE.

Citations

Codex CLI on GitLab: Duo Agent Platform, CI/CD Pipelines, and MCP Integration

2026-04-07T00:00:00+01:00

Codex CLI on GitLab: Duo Agent Platform, CI/CD Pipelines, and MCP Integration

While Codex CLI’s GitHub integration has received extensive coverage — from openai/codex-action to issue assignment via Copilot — GitLab teams have been building their own integration story. That story now has three distinct layers: the Duo Agent Platform for mention-driven automation, CI/CD pipeline jobs using codex exec for structured analysis, and MCP server connections for real-time repository access. This article covers all three, with production-ready configuration for each.

The Three Integration Layers

Before diving into configuration, it helps to understand where each layer fits in a GitLab workflow.

graph TD
    A["Developer Action"] --> B{"Integration Layer"}
    B -->|"@codex mention in MR/issue"| C["Duo Agent Platform
External Agent"]
    B -->|"Pipeline trigger on MR"| D["CI/CD Job
codex exec --full-auto"]
    B -->|"Local development"| E["MCP Server
GitLab API access"]

    C --> F["Codex reads repo context
+ CODEX.md rules"]
    D --> G["Structured JSON/Markdown
output as artifacts"]
    E --> H["Issue/MR/branch tools
in Codex session"]

    F --> I["Inline comment or
draft MR created"]
    G --> J["CodeClimate report in
MR widget"]
    H --> K["Agent-driven GitLab
operations"]

Each layer serves a different need: Duo for ad-hoc delegation, CI/CD for systematic analysis on every merge request, and MCP for interactive agent sessions that need GitLab API access.

Layer 1: Duo Agent Platform — External Agents

GitLab’s Duo Agent Platform reached general availability on 15 January 2026¹, bringing first-class support for external AI agents — including Codex CLI — directly into the GitLab workflow. Premium and Ultimate customers on GitLab 18.8+ (both SaaS and self-managed) can enable the Codex agent through the AI Catalog².

How It Works

When a developer mentions @codex (or the configured service account) in an issue comment or merge request discussion, GitLab triggers the external agent³. The agent:

Reads the repository tree and surrounding context
Loads project-specific rules from CODEX.md at the repository root
Decides whether code changes, review feedback, or clarification is needed
Responds inline with either a ready-to-merge change or a comment

The trigger mechanisms are³:

Trigger	Where	What Happens
Mention	Issue or MR comment	Agent analyses context and responds
Assignment	Issue or MR assignee	Agent works the issue autonomously
Reviewer assignment	MR reviewer	Agent performs code review

Configuration

The Codex agent uses GitLab-managed credentials through the AI Gateway, so there is no separate OPENAI_API_KEY to configure². Administrators add the agent via Settings → AI Catalog → GitLab-managed external agents → Add to AI Catalog².

For self-managed instances, the external agent configuration requires the gateway token injection⁴:

# External agent configuration (admin-level)
injectGatewayToken: true

This automatically provides AI_FLOW_AI_GATEWAY_TOKEN and AI_FLOW_AI_GATEWAY_HEADERS environment variables to the agent runtime⁴.

CODEX.md: The Project Rules File

All project-specific rules — style, testing, security policies — come from CODEX.md at the repository root⁵. This is distinct from AGENTS.md used by the CLI directly; GitLab’s integration reads CODEX.md specifically. A minimal example:

# Project Rules

## Code Style
- Use TypeScript strict mode
- All functions must have JSDoc comments
- Prefer `const` over `let`

## Testing
- Every new function needs a unit test
- Run `npm test` before proposing changes
- Minimum 80% branch coverage

## Security
- Never commit secrets or API keys
- Use parameterised queries for all database access
- Validate all user input at the controller boundary

Current Limitations

The Duo Agent Platform integration is still maturing. As of April 2026, the @codex mention workflow runs Codex in the background and responds asynchronously — there is no interactive steering³. The agent creates merge requests linked back to the originating issue but cannot yet trigger downstream pipelines automatically⁵. ⚠️ The exact latency and token limits for Duo-triggered Codex sessions are not publicly documented.

Layer 2: CI/CD Pipeline Integration with codex exec

For systematic, repeatable analysis on every merge request, embedding codex exec directly into .gitlab-ci.yml is the more mature approach. The official OpenAI Cookbook published a comprehensive guide to this pattern in March 2026⁶.

Code Quality Reports

The core pattern runs codex exec --full-auto with a structured prompt that generates GitLab-compliant CodeClimate JSON. The output appears directly in the merge request widget alongside native GitLab code quality results.

stages:
  - codex

default:
  image: node:24

codex_review:
  stage: codex
  variables:
    CODEX_QA_PATH: "gl-code-quality-report.json"
    CODEX_RAW_LOG: "artifacts/codex-raw.log"
  rules:
    - if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
      when: on_success
    - when: never
  script:
    - npm -g i @openai/codex@latest
    - FILE_LIST="$(git ls-files | sed 's/^/- /')"
    - |
      codex exec --full-auto "Review this repository and output a GitLab Code Quality report in CodeClimate JSON format.
      OUTPUT MUST BE A SINGLE JSON ARRAY between markers:
      === BEGIN_CODE_QUALITY_JSON ===
      
      === END_CODE_QUALITY_JSON ===
      Each issue: description, check_name, fingerprint, severity, location with path and lines.begin.
      Only report issues in: ${FILE_LIST}" \
        | tee "${CODEX_RAW_LOG}" >/dev/null
    - |
      sed -E 's/\x1B\[[0-9;]*[A-Za-z]//g' "${CODEX_RAW_LOG}" \
        | awk '/BEGIN_CODE_QUALITY_JSON/{grab=1;next}/END_CODE_QUALITY_JSON/{grab=0}grab' \
        > "${CODEX_QA_PATH}"
    - 'node -e "JSON.parse(require(\"fs\").readFileSync(\"${CODEX_QA_PATH}\",\"utf8\"))" || echo "[]" > "${CODEX_QA_PATH}"'
  artifacts:
    reports:
      codequality: gl-code-quality-report.json
    paths:
      - artifacts/
    expire_in: 14 days

The marker-based extraction pattern (=== BEGIN_... === / === END_... ===) is critical for reliability⁶. LLM output is inherently variable; the markers give the pipeline a deterministic extraction boundary. The ANSI escape stripping (sed -E 's/\x1B\[[0-9;]*[A-Za-z]//g') handles terminal colour codes that codex exec may emit⁶.

Security Remediation Pipeline

The cookbook’s second pattern is more ambitious: a two-stage pipeline where Codex first triages SAST findings, then generates validated patches for high/critical vulnerabilities⁶.

graph LR
    A["GitLab SAST Scanner"] -->|"gl-sast-report.json"| B["codex_recommendations
Stage 1: Triage"]
    B -->|"security_priority.md"| C["Human Review"]
    A -->|"gl-sast-report.json"| D["codex_resolution
Stage 2: Patch Gen"]
    D -->|"codex_patches/*.patch"| E["git apply --check
Validation"]
    E -->|"Valid patches"| F["Merge Request
with fixes"]

The remediation stage iterates over each high/critical vulnerability, constructs a per-finding prompt, and validates the generated diff with git apply --check before storing it as a .patch artefact⁶. Invalid patches are discarded automatically — only clean-applying fixes survive.

Key design decisions in this pattern:

Severity whitelisting: Only high and critical findings trigger remediation, avoiding wasted tokens on informational findings⁶
Per-vulnerability isolation: Each finding gets its own codex exec invocation, preventing cross-contamination between fixes⁶
Unified diff validation: git apply --check runs before any patch is stored, ensuring no broken diffs reach reviewers⁶

Authentication in CI/CD

Authentication uses masked CI/CD variables. Store OPENAI_API_KEY as a protected, masked variable in Settings → CI/CD → Variables⁶. For self-managed instances using Azure OpenAI instead, configure the CODEX_MODEL and endpoint variables accordingly.

variables:
  OPENAI_API_KEY: $OPENAI_API_KEY
  CODEX_MODEL: "gpt-5.4"  # or your preferred model

Cost Control

Each codex exec invocation in CI/CD consumes API tokens. For cost management:

Use gpt-5.4-mini for triage/quality jobs and reserve gpt-5.4 for remediation⁷
Set --model explicitly in the codex exec command to avoid inheriting a more expensive default
Monitor token usage via the postTaskComplete hook pattern or OpenTelemetry⁸

Layer 3: GitLab MCP Server Integration

For interactive Codex sessions that need to read issues, manage merge requests, or create branches on GitLab, the MCP integration provides structured API access.

GitLab’s Native MCP Server

GitLab ships its own MCP server⁹ that exposes repository, issue, merge request, and pipeline tools. Configure it in your Codex config.toml:

[mcp_servers.gitlab]
command = "codex"
args = ["mcp", "add", "--url", "https://gitlab.example.com/api/v4/mcp"]

Or add it directly via the CLI⁹:

codex mcp add --url "https://gitlab.example.com/api/v4/mcp"

Composio’s GitLab MCP

For teams wanting a managed MCP endpoint that bundles GitLab alongside other services, Composio provides a Tool Router that dynamically loads GitLab tools based on the task¹⁰:

[mcp_servers.composio]
url = "https://connect.composio.dev/mcp"
http_headers = { "x-api-key" = "${COMPOSIO_API_KEY}" }

This gives Codex access to GitLab operations — creating projects, managing issues, handling branches, and triggering pipelines — through a single MCP endpoint that also supports other integrations¹⁰.

Practical MCP Use Case: Issue Triage

With the GitLab MCP server configured, you can run an issue triage workflow locally:

codex exec --full-auto \
  "Read the open issues labelled 'needs-triage' in this project. \
   For each, add a priority label (P1/P2/P3) based on severity \
   and add a comment summarising the issue and suggested next steps."

The MCP server handles the GitLab API calls — listing issues, adding labels, posting comments — while Codex handles the reasoning and decision-making.

Choosing the Right Layer

Criterion	Duo Agent Platform	CI/CD Pipeline	MCP Server
Trigger	`@codex` mention	MR/pipeline event	Manual/scripted
Output	Inline comments, draft MRs	Artefacts, reports	GitLab API operations
Authentication	GitLab-managed	API key variable	API key + token
Cost visibility	Bundled in GitLab Credits¹	Direct API billing	Direct API billing
Best for	Ad-hoc delegation	Systematic analysis	Interactive workflows
Maturity	GA (Jan 2026)	Production-ready	Stable

For most teams, the recommended approach is: Duo for ad-hoc requests, CI/CD for every-MR analysis, and MCP for local development workflows that need GitLab context.

Enterprise Considerations

Self-Managed Deployment

Self-managed GitLab instances (18.8+) can enable external agents through the AI Catalog². The key requirement is network connectivity to the AI Gateway — or, for air-gapped environments, routing through Azure OpenAI endpoints configured as custom model providers in the Codex config.toml¹¹.

Audit Trail

All three integration layers produce audit evidence:

Duo: GitLab tracks agent interactions as system events
CI/CD: codex exec produces JSONL rollout files stored as pipeline artefacts
MCP: Standard MCP request/response logging via RUST_LOG=codex_core::mcp=debug

GitLab vs GitHub: Integration Comparison

Feature	GitHub (codex-action)	GitLab (CI/CD + Duo)
Native agent	Copilot issue assignment	Duo Agent Platform
CI/CD	`openai/codex-action`	`codex exec` in `.gitlab-ci.yml`
Code quality	PR checks	CodeClimate artefact in MR widget
Security	Dependabot + Codex Security	SAST + Codex remediation pipeline
MCP	GitHub MCP server	GitLab MCP server

The GitLab integration requires more manual configuration than GitHub’s first-party action, but offers equivalent capabilities once set up. The CI/CD pipeline approach is particularly powerful because GitLab’s artefact system natively understands CodeClimate JSON, making Codex quality findings appear in the same MR widget as native GitLab scanners⁶.

Citations

GitLab Inc., “GitLab Announces the General Availability of GitLab Duo Agent Platform,” 15 January 2026. https://ir.gitlab.com/news/news-details/2026/GitLab-Announces-the-General-Availability-of-GitLab-Duo-Agent-Platform/default.aspx ↩ ↩²
GitLab Docs, “External agents.” https://docs.gitlab.com/user/duo_agent_platform/agents/external/ ↩ ↩² ↩³ ↩⁴
GitLab Docs, “External agent configuration examples.” https://docs.gitlab.com/user/duo_agent_platform/agents/external_examples/ ↩ ↩² ↩³
GitLab Docs, “AI Catalog.” https://docs.gitlab.com/user/duo_agent_platform/ai_catalog/ ↩ ↩²
GitLab.org, “Product Requirements — Claude Code and OpenAI Codex CLI Integration for GitLab CI/CD (#557820).” https://gitlab.com/gitlab-org/gitlab/-/issues/557820 ↩ ↩²
OpenAI Cookbook, “Automating Code Quality and Security Fixes with Codex CLI on GitLab.” https://developers.openai.com/cookbook/examples/codex/secure_quality_gitlab ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰
OpenAI Developers, “Models.” https://developers.openai.com/api/docs/models ↩
OpenAI Developers, “Codex CLI Features.” https://developers.openai.com/codex/cli/features ↩
GitLab Docs, “GitLab MCP server.” https://docs.gitlab.com/user/gitlab_duo/model_context_protocol/mcp_server/ ↩ ↩²
Composio, “How to integrate Gitlab MCP with Codex.” https://composio.dev/toolkits/gitlab/framework/codex ↩ ↩²
OpenAI Developers, “Codex Configuration Reference.” https://developers.openai.com/codex/config-reference ↩

Codex CLI Model Lifecycle: Navigating Deprecations, Migrations, and the GPT-5.x Transition

2026-04-07T00:00:00+01:00

Codex CLI Model Lifecycle: Navigating Deprecations, Migrations, and the GPT-5.x Transition

OpenAI’s model release cadence has accelerated dramatically. In the eight months since the original GPT-5-Codex launched in September 2025, we have seen five major Codex-optimised model generations — and three deprecation waves.¹² If you maintain Codex CLI configurations across teams, CI pipelines, or custom harnesses, the churn is real. This article maps the full model timeline, explains the deprecation mechanics, and provides a practical migration playbook for the April 2026 landscape.

The Codex Model Timeline

The following timeline captures every Codex-optimised model release and its current status.

gantt
    title Codex Model Lifecycle (Sep 2025 – Jun 2026)
    dateFormat YYYY-MM-DD
    axisFormat %b %Y

    section Flagship
    GPT-5-Codex         :done,    gpt5c,   2025-09-23, 2026-04-01
    GPT-5.1-Codex       :done,    gpt51c,  2025-11-19, 2026-04-01
    GPT-5.2-Codex       :active,  gpt52c,  2025-12-18, 2026-06-05
    GPT-5.3-Codex       :active,  gpt53c,  2026-02-05, 2026-10-01
    GPT-5.4 (unified)   :active,  gpt54,   2026-03-05, 2026-10-01

    section Specialist
    GPT-5.1-Codex-Max   :done,    gpt51m,  2025-11-19, 2026-04-01
    GPT-5.3-Codex-Spark :active,  spark,   2026-02-12, 2026-10-01

    section Mini / Nano
    GPT-5-Codex-Mini     :done,   gpt5cm,  2025-09-23, 2026-04-01
    GPT-5.1-Codex-Mini   :done,   gpt51cm, 2025-11-19, 2026-04-01
    GPT-5.4-mini         :active, gpt54m,  2026-03-17, 2026-10-01
    GPT-5.4-nano         :active, gpt54n,  2026-03-17, 2026-10-01

Key dates

Model	Released	Deprecated	Replacement
GPT-5-Codex	23 Sep 2025¹	1 Apr 2026³	gpt-5.3-codex
GPT-5.1-Codex	19 Nov 2025⁴	1 Apr 2026³	gpt-5.3-codex
GPT-5.1-Codex-Max	19 Nov 2025⁴	1 Apr 2026³	gpt-5.3-codex
GPT-5.1-Codex-Mini	19 Nov 2025⁴	1 Apr 2026³	gpt-5.4-mini
GPT-5-Codex-Mini	23 Sep 2025¹	1 Apr 2026³	gpt-5.4-mini
GPT-5.2-Codex	18 Dec 2025⁵	5 Jun 2026⁶	gpt-5.3-codex
GPT-5.3-Codex	5 Feb 2026²	Current	—
GPT-5.3-Codex-Spark	12 Feb 2026⁷	Research preview	—
GPT-5.4	5 Mar 2026⁸	Current (recommended)	—
GPT-5.4-mini	17 Mar 2026⁹	Current	—
GPT-5.4-nano	17 Mar 2026⁹	Current	—

The April 1 deprecation wiped out the entire GPT-5.0 and GPT-5.1 Codex family in a single sweep.³ The next deprecation wave — GPT-5.2-Codex on 5 June 2026 — is less than two months away.⁶

What Happens When a Model Is Deprecated

When OpenAI deprecates a Codex model, the behaviour depends on your access method:

ChatGPT-authenticated users (the default for Codex CLI): the model silently disappears from the picker. If your config.toml still references it, Codex falls back to the current default model.¹⁰
API key users: requests to a deprecated model return an error. There is no automatic fallback — your pipeline breaks.
GitHub Copilot users: deprecated models are removed from all Copilot experiences including Chat, inline edits, and agent modes. Enterprise administrators must enable replacement models through Copilot settings policies.³
Azure OpenAI / Microsoft Foundry: Azure maintains its own retirement schedule which may lag behind or precede OpenAI’s by several weeks.¹¹

The config.toml Migration Mechanism

Codex CLI includes a built-in migration map for model names. When a deprecated model is referenced in configuration, Codex can recognise the old name and suggest or apply a replacement.¹²

# ~/.codex/config.toml — before migration
model = "gpt-5.1-codex"

After the April 1 deprecation, this configuration will either fall back to the default or fail, depending on your authentication method. The fix is straightforward:

# ~/.codex/config.toml — after migration
model = "gpt-5.4"

The recommended model stack (April 2026)

For most workflows, OpenAI now recommends gpt-5.4 as the default.¹⁰ Here is the current recommended stack:

# ~/.codex/config.toml

# Primary model — GPT-5.4 unifies coding + reasoning + computer use
model = "gpt-5.4"

# Review model — match or exceed your primary
review_model = "gpt-5.4"

# Reasoning effort — adjust per task complexity
model_reasoning_effort = "high"
plan_mode_reasoning_effort = "xhigh"

Profile-Based Model Management

The profiles system (experimental, March 2026) is the cleanest way to manage multiple model configurations and prepare for deprecation waves.¹³ Define profiles that isolate model choices, so a single deprecation requires only one line change per affected profile.

# ~/.codex/config.toml

# Default profile
model = "gpt-5.4"

[profiles.fast]
model = "gpt-5.4-mini"
model_reasoning_effort = "low"

[profiles.deep]
model = "gpt-5.4"
model_reasoning_effort = "xhigh"
plan_mode_reasoning_effort = "xhigh"

[profiles.spark]
model = "gpt-5.3-codex-spark"
model_reasoning_effort = "medium"

[profiles.legacy-52]
# ⚠️ Retiring 5 June 2026 — migrate to gpt-5.3-codex or gpt-5.4
model = "gpt-5.2-codex"

Switch profiles on the command line:

# Quick task with mini
codex --profile fast "add error handling to parse_config"

# Deep architectural review
codex --profile deep "review the authentication module for security issues"

# Real-time iteration with Spark
codex --profile spark "refactor this function step by step"

The GPT-5.4 Unification

GPT-5.4, released 5 March 2026, represents a significant architectural shift.⁸ It is the first mainline reasoning model to incorporate the frontier coding capabilities previously exclusive to the Codex-specific model line. In practical terms:

GPT-5.3-Codex remains the best pure coding model, scoring highest on SWE-bench Verified²
GPT-5.4 matches or exceeds GPT-5.3-Codex on coding while adding native computer use (75% OSWorld), stronger reasoning, and 1M token extended context⁸¹⁴
GPT-5.4-mini delivers 54.4% on SWE-Bench Pro at 30% of the credit consumption of the flagship — purpose-built for subagents⁹

flowchart TD
    A[Task arrives] --> B{Task complexity?}
    B -->|Simple fix / subagent| C[gpt-5.4-mini]
    B -->|Standard development| D[gpt-5.4]
    B -->|Pure coding, max accuracy| E[gpt-5.3-codex]
    B -->|Real-time iteration| F[gpt-5.3-codex-spark]

    C --> G{Cost sensitive?}
    G -->|Yes| H[gpt-5.4-nano]
    G -->|No| C

    D --> I[Default recommendation]
    E --> J[Legacy Codex-line — still current]
    F --> K[Pro subscribers only]

    style I fill:#2d6,stroke:#333,color:#fff
    style J fill:#26d,stroke:#333,color:#fff
    style K fill:#d62,stroke:#333,color:#fff

The question on many developers’ minds — raised publicly by Simon Willison — is whether the Codex model line will merge entirely into the mainline GPT series.¹⁵ The introduction of gpt-5-codex and gpt-5-codex-mini as unified model identifiers in late March 2026 suggests the answer is yes.¹⁶

Subagent Model Configuration for Multi-Agent Workflows

Deprecations hit hardest in multi-agent configurations where different agents may reference different models. With the April 2026 changes, audit every agent TOML file in .codex/agents/:

# .codex/agents/reviewer.toml — BEFORE (broken after April 1)
model = "gpt-5.1-codex-max"
model_reasoning_effort = "xhigh"

# .codex/agents/reviewer.toml — AFTER
model = "gpt-5.4"
model_reasoning_effort = "xhigh"

# .codex/agents/worker.toml — BEFORE (broken after April 1)
model = "gpt-5.1-codex-mini"
model_reasoning_effort = "medium"

# .codex/agents/worker.toml — AFTER
model = "gpt-5.4-mini"
model_reasoning_effort = "medium"

For the [agents] section controlling subagent defaults:

[agents]
max_threads = 4
max_depth = 2
# Subagent model — use mini for cost efficiency
# Previously gpt-5.1-codex-mini, now:
model = "gpt-5.4-mini"

CI/CD Pipeline Migration

Pipelines using codex exec with explicit model flags are the most fragile. A deprecated model causes an immediate hard failure in CI.

Defensive pattern: environment variable indirection

# .github/workflows/codex-review.yml
env:
  CODEX_MODEL: "gpt-5.4"
  CODEX_SUBAGENT_MODEL: "gpt-5.4-mini"

steps:
  - name: Run Codex review
    run: |
      codex exec \
        -c model="${CODEX_MODEL}" \
        -c review_model="${CODEX_MODEL}" \
        "Review all changed files for security issues"

When the next deprecation arrives, update a single environment variable rather than hunting through workflow files.

Defensive pattern: profile-based CI

# .codex/config.toml (committed to repo)
[profiles.ci]
model = "gpt-5.4"
model_reasoning_effort = "high"
approval_policy = "full-auto"
sandbox_mode = "locked-network"

codex exec --profile ci "run the test suite and fix failures"

The June 2026 Deprecation: Preparing Now

GPT-5.2-Codex retires on 5 June 2026.⁶ If you or your team still reference gpt-5.2-codex anywhere, here is a migration checklist:

Audit all config files: grep -r "gpt-5.2" ~/.codex/ .codex/ .codex/agents/
Check CI/CD: search workflow files for hardcoded model strings
Update AGENTS.md: if any agent instructions reference specific model names, update them
Test with the replacement: switch to gpt-5.3-codex or gpt-5.4 and verify your workflows produce equivalent output
Update custom harnesses: any code using the Responses API with explicit model parameters needs updating
Notify the team: if you use project-scoped .codex/config.toml, push the model change as a PR

# Quick audit across a monorepo
grep -rn "gpt-5\.\(1\|2\)" \
  ~/.codex/config.toml \
  .codex/ \
  .codex/agents/ \
  .github/workflows/ \
  2>/dev/null

Azure OpenAI Considerations

Azure OpenAI maintains a separate retirement schedule through Microsoft Foundry.¹¹ Key differences:

Azure deployments use deployment names, not model IDs — a deprecation requires redeploying, not just changing a string
Azure retirements may lag behind OpenAI’s by weeks
The api-version query parameter in your [model_providers] block must match the deployment’s API version

[model_providers.azure]
base_url = "https://your-resource.openai.azure.com/openai"
wire_api = "responses"

[model_providers.azure.query_params]
api-version = "2026-03-01-preview"

⚠️ Azure Entra ID token authentication with Codex CLI has a known limitation (issue #13241) — static API keys remain more reliable for automated workflows.

Best Practices for Model Lifecycle Management

Never hardcode model names in scripts — use config.toml profiles or environment variables
Pin to the recommended model (gpt-5.4) unless you have a specific reason not to
Subscribe to the changelog at developers.openai.com/codex/changelog and the GitHub Changelog for deprecation notices
Test model changes in a branch before rolling out to the team
Use the codex exec structured output (--output-schema) to detect regressions when switching models
Keep subagent models one tier below the primary — gpt-5.4-mini for subagents, gpt-5.4 for the orchestrator
Set calendar reminders for announced deprecation dates — the June 5 GPT-5.2 retirement is next

What Is Next

The model identifier consolidation — with gpt-5-codex and gpt-5-codex-mini appearing as unified aliases in late March¹⁶ — suggests OpenAI may move toward rolling model identifiers that always point to the latest Codex-optimised model. If this happens, explicit version pinning would become opt-in rather than the default, significantly reducing deprecation churn.

Until then, treat model lifecycle management as a first-class operational concern. The eight-month pattern is clear: new Codex models arrive every 6–10 weeks, and old ones retire within 4–5 months. Plan accordingly.

Citations

GPT-5-Codex Model documentation — OpenAI API reference, September 2025 ↩ ↩² ↩³
Introducing GPT-5.3-Codex — OpenAI blog, 5 February 2026 ↩ ↩² ↩³
GPT-5.1 Codex, GPT-5.1-Codex-Max, and GPT-5.1-Codex-Mini deprecated — GitHub Changelog, 3 April 2026 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷
Building more with GPT-5.1-Codex-Max — OpenAI blog, 19 November 2025 ↩ ↩² ↩³
Introducing GPT-5.2-Codex — OpenAI blog, 18 December 2025 ↩
Retiring GPT-4o and older models — OpenAI blog, February 2026; GPT-5.2 Thinking retires 5 June 2026 ↩ ↩² ↩³
Codex CLI Changelog — Codex-Spark research preview — OpenAI Developers, February 2026 ↩
Introducing GPT-5.4 — OpenAI blog, 5 March 2026 ↩ ↩² ↩³
Introducing GPT-5.4 mini and nano — OpenAI blog, 17 March 2026 ↩ ↩² ↩³
Codex Models documentation — OpenAI Developers, current ↩ ↩²
Azure OpenAI Model Retirements — Microsoft Learn, current ↩ ↩²
Codex Configuration Reference — OpenAI Developers, current ↩
Codex Sample Configuration — OpenAI Developers, current ↩
GPT-5.4 Complete Guide 2026 — NxCode, March 2026 ↩
GPT-5.4 discussion — Simon Willison’s question on Codex model line merger — referenced in community discussion, March 2026 ↩
gpt-5-codex: The New Codex Flagship and What It Means for Your Workflow — Daniel Vaughan / Codex Resources, 30 March 2026 ↩ ↩²

Codified Context: The Three-Tier Knowledge Architecture for AI Coding Agents

2026-04-07T00:00:00+01:00

Codified Context: The Three-Tier Knowledge Architecture for AI Coding Agents

Dumping everything into a single AGENTS.md file works until it doesn’t. At some point—typically around 20,000 lines of codebase—you hit the context wall: the constitution grows unwieldy, the agent forgets domain nuances, and you find yourself re-explaining the same architectural constraints every session. Aristidis Vasilopoulos’s February 2026 paper, Codified Context: Infrastructure for AI Agents in a Complex Codebase ¹, offers a rigorous, empirically-validated alternative: a three-tier knowledge architecture that maps cleanly onto Codex CLI’s existing primitives.

This article unpacks the paper’s findings, maps them to Codex CLI’s current feature set, and provides concrete implementation patterns.

The Three-Tier Model

The core insight is straightforward: not all context is equal. Some knowledge must be present in every session (hot memory), some is needed only for specific task types (warm, specialist knowledge), and some is referenced rarely but must be queryable on demand (cold memory). The paper formalises this into three tiers ¹:

graph TD
    A[Human Prompt] --> B{Session Start}
    B --> C[Tier 1: Constitution
Always Loaded]
    C --> D{Task Classification}
    D --> E[Tier 2: Specialist Agent
Invoked Per Task]
    D --> F[Tier 2: Another Specialist
Invoked Per Task]
    E --> G{Need Reference Data?}
    F --> G
    G -->|Yes| H[Tier 3: MCP Knowledge Server
Queried On Demand]
    G -->|No| I[Execute Task]
    H --> I

Tier	Role	Codex CLI Mapping	Files	Lines	% of Codebase
T1	Constitution (Hot Memory)	`AGENTS.md`	1	~660	0.6%
T2	Specialist Agents (Warm)	`.codex/agents/*.toml`	19	~9,300	8.6%
T3	Knowledge Base (Cold Memory)	MCP knowledge servers	34	~16,250	15.0%
Total			54	~26,200	24.2%

The metrics come from a real 108,000-line C# distributed system tracked across 283 development sessions and 2,801 human prompts ¹. Crucially, Vasilopoulos warns that the 24.2% context infrastructure ratio reflects this project’s complexity and domain—it is not a universal target ¹.

Tier 1: The Constitution (AGENTS.md)

The constitution is the only file that loads into every session. It defines non-negotiable rules: coding standards, architectural boundaries, forbidden patterns, and the trigger table that routes tasks to specialist agents.

In Codex CLI, this maps directly to AGENTS.md ². Since February 2026, AGENTS.md is an open standard under the Linux Foundation’s Agentic AI Foundation, readable by Codex, Cursor, Copilot, Amp, Windsurf, and Gemini CLI ³. Codex loads AGENTS.md from both ~/.codex/ (global) and per-directory (repo-scoped), with closer files taking precedence ².

A well-structured constitution for a tiered architecture includes the trigger table directly:

# AGENTS.md

## Routing Rules

When the task involves **network protocols or sync logic**, delegate to
the `network-protocol-designer` agent before making changes.

When the task involves **coordinates, camera, or spatial transforms**,
delegate to the `coordinate-wizard` agent.

After any structural change, invoke the `code-reviewer-game-dev` agent
for review.

## Architectural Boundaries

- ECS components MUST NOT hold references to MonoBehaviours
- Network messages MUST be defined in the shared assembly
- All coordinate transforms go through CoordinateService

Research by Santos et al. found that well-structured AGENTS.md files correlate with a 29% reduction in median runtime and 17% reduction in output token consumption ⁴.

Tier 2: Specialist Agents

This is where the paper’s approach diverges from the “one massive context file” pattern. Rather than cramming domain knowledge into the constitution, each specialist area gets its own agent definition with focused expertise.

In Codex CLI, custom agents live in .codex/agents/ (project-scoped) or ~/.codex/agents/ (personal) as TOML files ⁵. Subagents and custom agents reached GA on 16 March 2026 ⁶.

# .codex/agents/network-protocol-designer.toml
name = "network-protocol-designer"
model = "GPT-5.4"
model_reasoning_effort = "high"
sandbox_mode = "read-only"

[instructions]
content = """
You are the network protocol specialist for ProjectX.
Key constraints:
- All messages use the NetworkMessage base class
- Serialisation uses MessagePack, never JSON
- Maximum message size: 512 bytes
- Tick rate: 20Hz server, 60Hz client interpolation
- See specs/network-protocol-v3.md for the full wire format
"""

The paper’s 108K-line project used 19 specialist agents. Across 757 classifiable agent invocations, 432 (57%) went to project-specific specialists rather than built-in tool agents ¹. The most frequently invoked were the code reviewer (154 invocations) and the network-protocol-designer (85 invocations) ¹.

The Trigger Table Pattern

The paper formalises task routing through a trigger table—a mapping from signals in the human prompt to the appropriate specialist ¹:

Trigger Phase	Signal	Agent
Pre-change	Network, sync	network-protocol-designer
Pre-change	Coordinates, camera	coordinate-wizard
Pre-change	Abilities end-to-end	ability-designer
Post-change	Architecture, design	systems-designer
Post-change	ECS or network files	code-reviewer-game-dev

In practice, you encode this in your AGENTS.md (Tier 1) and rely on the model to follow the routing. Note that Codex CLI does not currently auto-spawn custom subagents—explicit delegation prompts are required ⁵. There is an open issue (#14161) regarding [[skills.config]] in agent TOML being ignored for sub-agents ⁷.

Tier 3: MCP Knowledge Servers

Cold memory—specification documents, API references, wire format definitions—lives behind MCP (Model Context Protocol) servers. These are queried on demand rather than loaded into every session, keeping the base context window lean.

Codex CLI treats MCP as a first-class citizen ⁸. Configuration lives in .codex/config.toml:

# .codex/config.toml
[mcp_servers.knowledge-base]
type = "stdio"
command = "node"
args = ["./mcp-servers/knowledge-retriever/index.js"]

[mcp_servers.specs-server]
type = "http"
url = "http://localhost:3001/mcp"

MCP servers are managed via codex mcp add, codex mcp list, and codex mcp login ⁸. Servers launch automatically when a session starts and support both STDIO and streaming HTTP transports ⁸.

The paper’s companion repository provides a reference MCP retrieval server that exposes two key tools ⁹:

find_relevant_context(task) — returns matching specification fragments
suggest_agent(task) — recommends the appropriate Tier 2 specialist

sequenceDiagram
    participant H as Human
    participant C as Codex CLI
    participant A as AGENTS.md (T1)
    participant S as Specialist Agent (T2)
    participant M as MCP Server (T3)

    H->>C: "Refactor network handshake"
    C->>A: Load constitution
    A-->>C: Route to network-protocol-designer
    C->>S: Spawn specialist agent
    S->>M: find_relevant_context("network handshake")
    M-->>S: specs/network-protocol-v3.md (relevant sections)
    S-->>C: Proposed changes
    C->>A: Route to code-reviewer-game-dev
    Note over C: Post-change review trigger

Practical Implementation

Bootstrapping the Architecture

The companion repository includes three factory agents for bootstrapping the tier infrastructure in an existing project ⁹:

Constitution Generator — analyses the codebase and drafts an initial AGENTS.md
Agent Extractor — identifies domain clusters and generates specialist TOML files
Knowledge Indexer — catalogues specification documents for MCP serving

Maintenance Budget

The paper reports a maintenance overhead of approximately 1–2 hours per week: a biweekly review pass of 30–45 minutes each ¹. Meta-infrastructure prompts—those specifically about building and maintaining the knowledge architecture itself—accounted for just 4.3% of substantive prompts ¹.

Prompt Efficiency

A striking finding: over 80% of human prompts in the study were 100 words or fewer ¹. The tiered architecture front-loads context so thoroughly that terse prompts suffice. This aligns with the broader principle that good context engineering reduces prompt engineering effort.

Connecting to the Agentic Pod Pattern

The three-tier model maps naturally onto the emerging agentic pod architecture, where multiple AI agents collaborate on a shared codebase:

graph LR
    subgraph Pod
        O[Orchestrator] --> A1[Specialist: Architecture]
        O --> A2[Specialist: Testing]
        O --> A3[Specialist: Security]
        O --> A4[Specialist: Domain Expert]
    end
    A1 & A2 & A3 & A4 --> KB[MCP Knowledge Servers]
    O --> CONST[AGENTS.md Constitution]

Each pod member is effectively a Tier 2 specialist, the shared constitution (Tier 1) ensures consistency, and MCP servers (Tier 3) provide the shared reference library. The Codex CLI subagent system supports spawning specialists in parallel and collecting results ⁶, making this pattern directly implementable today.

Current Model Considerations

When configuring Tier 2 agents, note the current model landscape. As of April 2026, GPT-5.4 is the recommended default model, combining coding, reasoning, and native computer use ¹⁰. The GPT-5.1-Codex family was deprecated on 3 April 2026 ¹¹. GPT-5.3-Codex and GPT-5.2-Codex remain available for specific use cases ¹⁰. Authentication now primarily uses “Sign in with ChatGPT” rather than API keys ¹⁰.

Key Takeaways

Separate hot, warm, and cold context — not everything belongs in AGENTS.md
The trigger table is the glue — encode routing rules in Tier 1, domain knowledge in Tier 2
MCP servers keep the context window lean — query specifications on demand, don’t pre-load them
57% specialist usage validates the investment in domain-specific agents ¹
1–2 hours per week is a realistic maintenance budget for a complex project ¹
24.2% is not a target — measure your own ratio and adjust to your project’s needs

The paper’s companion repository at github.com/arisvas4/codified-context-infrastructure provides a complete reference implementation ⁹.

Citations

Vasilopoulos, A. (2026). “Codified Context: Infrastructure for AI Agents in a Complex Codebase.” arXiv:2602.20478v1. https://arxiv.org/abs/2602.20478 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹²
OpenAI. “Codex CLI — AGENTS.md Guide.” https://developers.openai.com/codex/guides/agents-md ↩ ↩²
Linux Foundation Agentic AI Foundation. AGENTS.md open standard. Referenced in: “AGENTS.md: The Open Standard for Cross-Tool AI Agent Portability.” https://developers.openai.com/codex/guides/agents-md ↩
Santos, R. et al. Referenced in Substack analysis: “Scaling your coding agent’s context beyond a single AGENTS.md-file.” https://ursula8sciform.substack.com/p/scaling-your-coding-agents-context ↩
OpenAI. “Codex CLI — Subagents.” https://developers.openai.com/codex/subagents ↩ ↩²
Simon Willison. “Codex Subagents.” 16 March 2026. https://simonwillison.net/2026/Mar/16/codex-subagents/ ↩ ↩²
GitHub issue #14161 — [[skills.config]] in agent TOML ignored for sub-agents. https://github.com/openai/codex/issues/14161 ↩
OpenAI. “Codex CLI — MCP Integration.” https://developers.openai.com/codex/mcp ↩ ↩² ↩³
Vasilopoulos, A. Codified Context Infrastructure — companion repository. https://github.com/arisvas4/codified-context-infrastructure ↩ ↩² ↩³
OpenAI. “Codex CLI — Models.” https://developers.openai.com/codex/models ↩ ↩² ↩³
GitHub Blog. “GPT-5.1-Codex, GPT-5.1-Codex-Max, and GPT-5.1-Codex-Mini deprecated.” 3 April 2026. https://github.blog/changelog/2026-04-03-gpt-5-1-codex-gpt-5-1-codex-max-and-gpt-5-1-codex-mini-deprecated/ ↩

Automating the Cross-Model Review Loop: Three Levels from SKILL.md to Multi-AI Pipeline

2026-04-07T00:00:00+01:00

Automating the Cross-Model Review Loop: Three Levels from SKILL.md to Multi-AI Pipeline

The cross-model review pattern — where one AI writes code and a structurally different AI reviews it — has become a core quality practice in agentic development. Claude Code and Codex CLI have different training distributions and different blind spots, making their disagreements genuinely informative¹. By late March 2026, the ecosystem offers three distinct automation tiers, each trading setup complexity for hands-off operation. This article walks through all three, with concrete configuration and the security caveats you need to understand before deploying them.

Why Cross-Model Review Works

Single-model review suffers from sycophancy bias: the same system that wrote the code tends to approve it². Cross-provider review sidesteps this because Claude and GPT-5.x have fundamentally different failure modes. When both models flag the same issue, confidence is high. When only one flags it, that disagreement is the signal worth investigating — the “two doctors, same patient” heuristic¹.

The standard execution path uses codex exec in non-interactive mode with a read-only sandbox, ensuring the reviewer cannot modify the codebase it is assessing³:

codex exec -m gpt-5.3-codex -s read-only "Review the following diff for bugs, security issues, and style violations: $(git diff HEAD~1)"

Level 1: SKILL.md — Manual Trigger, Minimal Setup

A SKILL.md file is a single Markdown document placed in .claude/skills/ that any LLM agent can parse¹. This is the lowest-friction entry point: no plugins, no hooks, no external dependencies beyond a working codex binary.

Directory Structure

.claude/
  skills/
    codex-review/
      SKILL.md

The Review Loop

The SKILL.md defines a /codex-review slash command that executes a sequential fix loop:

flowchart TD
    A["/codex-review invoked"] --> B["Export current plan/diff"]
    B --> C["codex exec read-only review"]
    C --> D{"Verdict?"}
    D -->|PASS| E["Review complete"]
    D -->|CONCERNS| F["Claude addresses findings"]
    F --> G{"Round < 5?"}
    G -->|Yes| C
    G -->|No| H["Escalate to human"]

Each round uses a UUID-bound session ID for concurrency safety, and the review runs under --sandbox read-only to enforce immutability¹. The key codex exec invocations:

# Initial review
codex exec -m gpt-5.3-codex -s read-only \
  "Review this plan against the codebase. Respond PASS or CONCERNS with details."

# Re-review after fixes (resume session for context continuity)
codex exec resume  \
  "Re-review the updated plan. Previous concerns were: ..."

Level 1.5: Fresh-Session Audit

A refinement worth adopting early: after the fix loop converges, spawn a fresh Codex session for a final audit¹. This eliminates context bias from the iterative conversation and catches systemic issues the loop might have normalised. The audit uses a distinct verdict format — AUDIT: PASS or AUDIT: CONCERNS — to differentiate it from loop rounds.

When to use Level 1: Solo developers or small teams wanting to validate the cross-model approach before investing in automation infrastructure. Setup time is under five minutes.

Level 2: Stop Hook Plugins — Automatic Trigger

Level 2 eliminates the manual /codex-review invocation by hooking into Codex CLI’s lifecycle system. When Claude Code attempts to complete a turn, a Stop hook intercepts the exit and triggers a Codex review automatically⁴.

How Codex Hooks Work

Hooks are defined in hooks.json at user level (~/.codex/hooks.json) or repository level (/.codex/hooks.json)⁵. The Stop hook fires at conversation turn completion:

{
  "hooks": {
    "Stop": [
      {
        "hooks": [
          {
            "type": "command",
            "command": ".claude-plugin/hooks/stop-hook.sh",
            "statusMessage": "Running cross-model review...",
            "timeout": 900
          }
        ]
      }
    ]
  }
}

The hook communicates its decision via exit codes⁵:

Exit 0 with JSON {"decision": "block", "reason": "..."} — blocks the stop, feeds the reason back as a continuation prompt
Exit 0 without blocking JSON — permits the stop
Exit 2 — blocks; reads reason from stderr

Option A: codex-plugin-cc (Official)

OpenAI released codex-plugin-cc on 30 March 2026⁶, providing a single-command review gate:

# Install
/plugin marketplace add openai/codex-plugin-cc
/plugin install codex@openai-codex
/codex:setup

# Enable automatic review gate
/codex:setup --enable-review-gate

When enabled, every Claude Code turn completion triggers a targeted Codex review. If issues are found, the stop is blocked and Claude addresses the findings before the turn can end⁶. The plugin also exposes manual commands:

Command	Purpose
`/codex:review --base main`	Diff review against a branch
`/codex:adversarial-review`	Devil’s advocate design challenge
`/codex:rescue --background`	Delegate a task to Codex asynchronously

⚠️ Cost warning: The review gate can create long-running loops that rapidly consume usage limits. OpenAI’s own documentation recommends enabling it only under human supervision⁶.

Option B: claude-review-loop (Community)

The claude-review-loop plugin by Hamel Husain takes a more opinionated approach, spawning up to four parallel Codex sub-agents based on project type⁷:

Sub-Agent	Trigger	Focus
Diff Review	Always	Code quality, tests, OWASP Top 10
Holistic Review	Always	Architecture, documentation
Next.js Review	`next.config.*` present	App Router, Server Components, caching
UX Review	Frontend code detected	Browser E2E via agent-browser, accessibility

# Install
/plugin marketplace add hamelsmu/claude-review-loop
/plugin install review-loop@hamel-review

Codex deduplicates findings across agents and writes consolidated output to reviews/review-.md⁷. State is tracked in .claude/review-loop.local.md (gitignored).

Security: The bypass-sandbox Default

Both community plugins default to --dangerously-bypass-approvals-and-sandbox for Codex execution⁷. This is necessary because the review agents need file-system read access, but it means Codex runs without sandbox constraints. Override this with:

export REVIEW_LOOP_CODEX_FLAGS="--sandbox read-only"

For codex-plugin-cc, the official plugin uses the Codex app server which applies its own sandbox policy, making this less of a concern⁶.

Preventing Infinite Loops

A critical implementation detail: your stop hook must check a stop_hook_active flag before spawning another review¹. Without this guard, the review’s own completion triggers another stop hook, creating an infinite loop:

#!/bin/bash
STATE_FILE=".claude/review-loop.local.md"
if grep -q "stop_hook_active: true" "$STATE_FILE" 2>/dev/null; then
  exit 0  # Permit stop — we're already in a review cycle
fi

Level 3: Multi-AI Pipeline Governance

Level 3 moves beyond a single reviewer to orchestrated multi-model pipelines where different AI systems handle distinct quality dimensions.

claude-codex: Sequential Review Chain

The claude-codex plugin (Z-M-Huang) implements a three-reviewer pipeline⁸:

flowchart LR
    A["Implementation\n(Claude Sonnet)"] --> B["Review 1\n(Claude Sonnet)"]
    B --> C["Review 2\n(Claude Opus)"]
    C --> D["Final Gate\n(Codex CLI)"]
    D -->|Pass| E["Approved"]
    D -->|Fail| F["Fix + Re-review"]

Each reviewer independently validates against OWASP Top 10 vulnerabilities⁸. The pipeline enforces sequential dependencies via blockedBy constraints — Review 2 cannot start until Review 1 approves. If any reviewer requests changes, a fix task and re-review are automatically created.

# Feature development with full pipeline
/claude-codex:multi-ai Add rate limiting to the authentication endpoint

# Bug fix with dual root-cause analysis
/claude-codex:bug-fix Session tokens not invalidated on password change

Configuration controls iteration limits⁸:

Plan review loop: 10 iterations maximum
Code review loop: 15 iterations maximum
Auto-resolve attempts: 3 retries before pausing for human input

⚠️ Note: This repository was archived on 22 February 2026; development continues at Z-M-Huang/vcp/plugins/dev-buddy⁸.

GitHub Agent HQ: Platform-Level Integration

GitHub’s Agent HQ, in public preview since February 2026, achieves platform-level cross-model integration¹. From a single issue, you can launch Copilot, Claude Code, and Codex agents simultaneously, comparing their outputs. This requires Copilot Pro+ or Enterprise licensing.

Mapping to Agentic Pod Roles

The three levels map naturally to agentic pod structures¹:

Level	Pod Role Equivalent	Team Size
Level 1 (SKILL.md)	Solo developer self-review	1–2
Level 2 (Stop Hook)	Quality Engineer in the loop	3–8
Level 3 (Pipeline)	Full pod with dedicated QA	8+

Choosing Your Level

flowchart TD
    A["Starting cross-model review?"] --> B{"Team size?"}
    B -->|"Solo / pair"| C["Level 1: SKILL.md\n5 min setup"]
    B -->|"Small team"| D{"Want automatic triggers?"}
    D -->|Yes| E["Level 2: Stop Hook\ncodex-plugin-cc or\nclaude-review-loop"]
    D -->|No| C
    B -->|"Large team / enterprise"| F["Level 3: Pipeline\nclaude-codex or\nGitHub Agent HQ"]
    E --> G{"Need multi-reviewer?"}
    G -->|Yes| F
    G -->|No| E

Start with Level 1 to validate that cross-model review catches real issues in your codebase. Promote to Level 2 when you find yourself routinely forgetting to invoke the review. Graduate to Level 3 when your team needs formalised quality gates with audit trails.

Practical Recommendations

Always enforce read-only sandbox for review agents. A reviewer that can modify code is a reviewer that can mask its own findings.
Set explicit timeouts. The default 900-second timeout for stop hooks is generous; most reviews complete in under 60 seconds. Reduce to 120 seconds to fail fast on stuck sessions.
Monitor token consumption. Level 2 and 3 multiply your API usage significantly. Use --model gpt-5.4-mini for routine reviews and reserve full models for adversarial passes⁶.
Git-ignore review state files. Both .claude/review-loop.local.md and .task/ directories contain transient state that should not enter version control.
Pin your reviewer model. Use explicit model identifiers in configuration rather than aliases to avoid unexpected behaviour when model defaults change.

Citations

SmartScope, “Automating the Claude Code × Codex Review Loop — Three Levels,” March 2026. https://smartscope.blog/en/blog/claude-code-codex-review-loop-automation-2026/ ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸
MindStudio, “What Is the OpenAI Codex Plugin for Claude Code? How Cross-Provider AI Review Works,” 2026. https://www.mindstudio.ai/blog/openai-codex-plugin-claude-code-cross-provider-review ↩
OpenAI, “Agent approvals & security – Codex,” 2026. https://developers.openai.com/codex/agent-approvals-security ↩
OpenAI, “Introducing Codex Plugin for Claude Code,” OpenAI Developer Community, March 2026. https://community.openai.com/t/introducing-codex-plugin-for-claude-code/1378186 ↩
OpenAI, “Hooks – Codex,” OpenAI Developers, 2026. https://developers.openai.com/codex/hooks ↩ ↩²
OpenAI, “codex-plugin-cc,” GitHub, March 2026. https://github.com/openai/codex-plugin-cc ↩ ↩² ↩³ ↩⁴ ↩⁵
Hamel Husain, “claude-review-loop,” GitHub, 2026. https://github.com/hamelsmu/claude-review-loop ↩ ↩² ↩³
Z-M-Huang, “claude-codex: Multi-AI orchestration plugin,” GitHub, 2026. https://github.com/Z-M-Huang/claude-codex ↩ ↩² ↩³ ↩⁴

Learning Plan for Becoming a Codex CLI Expert

2026-04-07T00:00:00+01:00

Learning Plan for Becoming a Codex CLI Expert

Codex CLI has grown from a prototype terminal assistant into a full agentic coding platform — sub-agents, skills, MCP integrations, worktrees, cloud tasks, and an enterprise governance model¹. The surface area is large enough that a structured learning plan pays for itself quickly. This guide maps a four-phase path from first install to production-grade orchestration, with concrete exercises and milestones at each level.

Phase 1 — Foundations (Week 1–2)

The goal is a working installation, confident navigation of the TUI, and an intuitive feel for the approval model.

1.1 Installation and Authentication

Install via npm (or the Windows installer if you are on Windows, which reached full feature parity in March 2026²):

npm install -g @openai/codex
codex login          # OAuth or API key
codex --version      # confirm 0.118.x or later

Verify your default model. As of April 2026 the recommended default is gpt-5.4, which combines the coding strength of gpt-5.3-codex with stronger reasoning and native computer use³.

1.2 The Approval Model

Codex CLI’s security posture rests on three approval modes⁴:

Mode	File edits	Shell commands	Network	Best for
`suggest` (default)	Approval required	Approval required	Blocked	Learning, auditing
`auto-edit`	Auto-applied	Approval required	Blocked	Day-to-day development
`full-auto`	Auto-applied	Auto-executed	Available	CI/CD, automation

Switch at launch or mid-session:

codex --approval-mode auto-edit
# or inside the TUI:
/permissions

The sandbox layer underneath (read-only, workspace-write, danger-full-access) is orthogonal to approval mode⁵. Understanding both dimensions is the first genuine milestone.

1.3 First Exercises

Explain a file — open a repository you know well, run codex in suggest mode, and ask it to explain a complex module. Observe how it reads files.
Fix a bug — switch to auto-edit, paste a stack trace, and let Codex propose a patch. Review the diff before accepting.
Run tests — use /permissions to switch to full-auto inside the session and ask Codex to run the test suite and fix any failures.

flowchart LR
    A[suggest] -->|"/permissions"| B[auto-edit]
    B -->|"/permissions"| C[full-auto]
    C -->|"/permissions"| A
    style A fill:#e8f5e9
    style B fill:#fff3e0
    style C fill:#ffebee

Milestone: You can install Codex, authenticate, switch between approval modes, and explain the sandbox/approval matrix to a colleague.

Phase 2 — Configuration and Context (Week 3–4)

The goal is to make Codex consistently useful by giving it durable project knowledge and personalised defaults.

2.1 config.toml

Codex reads ~/.codex/config.toml for persistent settings⁶. A sensible starter:

model = "gpt-5.4"
approval_mode = "auto-edit"

[history]
persistence = "across-sessions"

[project_doc]
max_bytes = 65536
fallback_filenames = ["TEAM_GUIDE.md", ".agents.md"]

Profiles let you maintain separate configurations per client or project:

codex --profile enterprise-client

2.2 AGENTS.md — Your Constitution

AGENTS.md is Codex’s instruction discovery system⁷. It follows a three-tier hierarchy:

Global — ~/.codex/AGENTS.md (or AGENTS.override.md for highest priority)
Repository root — checked into version control with the team
Subdirectory — progressively more specific guidance, concatenated from root downward

Files are merged until project_doc_max_bytes (32 KiB by default) is reached⁷. A minimal project-level example:

# AGENTS.md

## Language & Style
- TypeScript with strict mode; no `any` types
- Prefer `pnpm` over `npm`
- British English in comments and documentation

## Testing
- Every public function needs a unit test
- Use Vitest, not Jest

## Restrictions
- Never modify `package-lock.json` directly
- Do not install new dependencies without asking

Verify what loaded:

codex --ask-for-approval never "Summarise the current instructions."

2.3 Exercise: Build Your AGENTS.md Stack

Create a global ~/.codex/AGENTS.md with your personal coding preferences.
Add a repository-level AGENTS.md with project conventions.
Add a subdirectory AGENTS.override.md in a module that has stricter rules (e.g. no external network calls in a security module).
Run the verification command and confirm all three layers appear.

Milestone: You have a config.toml with sensible defaults, a layered AGENTS.md stack, and can explain the merge order.

Phase 3 — Intermediate Patterns (Week 5–8)

3.1 MCP Integration

Model Context Protocol connects Codex to external tools and data sources⁸. Two transport types are supported:

STDIO — local processes, configured via CLI or config.toml:

codex mcp add context7 -- npx -y @upstash/context7-mcp

Streaming HTTP — remote servers with bearer token authentication:

[mcp_servers.docs-server]
url = "https://docs.internal.co/mcp"
bearer_token_env_var = "DOCS_MCP_TOKEN"
tool_timeout_sec = 30

Use /mcp in the TUI to inspect active servers. Use enabled_tools and disabled_tools to control which tools from a server are exposed⁸.

For OAuth-enabled servers:

codex mcp login docs-server

3.2 Skills

A skill packages instructions, resources, and optional scripts so Codex can follow a workflow reliably⁹. The minimum structure:

.agents/skills/lint-fix/
├── SKILL.md
└── agents/
    └── openai.yaml   # optional: UI metadata, tool deps

The SKILL.md front matter:

---
name: lint-fix
description: Fix all ESLint errors in staged files
---

1. Run `npx eslint --fix $(git diff --cached --name-only)`
2. Stage the fixed files
3. Report remaining unfixable errors

Skills are discovered from four scopes: repository (.agents/skills/), user ($HOME/.agents/skills), admin (/etc/codex/skills), and built-in⁹. Use $skill-creator to scaffold new skills interactively.

Invoke explicitly with /skills or $skill-name, or let Codex match implicitly based on task description.

3.3 Model Selection Strategy

Not every task needs gpt-5.4. A practical model allocation³:

flowchart TD
    T[Task arrives] --> Q{Complexity?}
    Q -->|High: architecture, refactoring| A["gpt-5.4"]
    Q -->|Medium: feature implementation| B["gpt-5.3-codex"]
    Q -->|Low: search, formatting, docs| C["gpt-5.4-mini"]
    A --> R[Review output]
    B --> R
    C --> R

Switch mid-session with /model — no restart needed³.

3.4 Exercises

MCP — connect a documentation MCP server and ask Codex to answer questions using it.
Skills — create a skill that runs your team’s code review checklist and packages results into a PR comment.
Model switching — use gpt-5.4-mini for a codebase search task, then switch to gpt-5.4 for a refactoring task, and compare cost and quality.

Milestone: You have at least one MCP server connected, one custom skill, and a model selection heuristic you can articulate.

Phase 4 — Advanced Orchestration (Week 9–12)

4.1 Sub-Agents and Worktrees

Sub-agents let you parallelise larger tasks¹⁰. Since version 0.117.0, sub-agents use readable path-based addresses like /root/agent_a with structured messaging¹¹.

Worktrees isolate each agent in its own Git branch, so multiple agents can modify the same repository without conflicts¹⁰. The desktop app handles worktree lifecycle automatically; from the CLI you manage it via /agent commands.

A practical pattern: use gpt-5.4 as a planning coordinator that delegates narrower subtasks (file review, test writing, documentation) to gpt-5.4-mini sub-agents running in parallel worktrees.

4.2 CI/CD Integration

codex exec is the non-interactive mode designed for pipelines¹²:

# In a GitHub Actions workflow
codex exec --full-auto --model gpt-5.4-mini \
  "Review this PR diff and post a summary comment" \
  < <(gh pr diff $PR_NUMBER)

As of 0.118.0, codex exec supports prompt-plus-stdin workflows, so you can pipe input and still pass a separate prompt¹¹.

For scheduled maintenance:

# .github/workflows/codex-sweep.yml
name: Weekly dependency sweep
on:
  schedule:
    - cron: '0 9 * * 1'
jobs:
  sweep:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm i -g @openai/codex
      - run: |
          codex exec --full-auto \
            "Update outdated dependencies, run tests, \
             and open a PR if everything passes"

4.3 Enterprise Governance

For teams, governance comes through version-controlled configuration¹³:

AGENTS.md in source control — policy changes go through PR review, providing an audit trail
Profiles — codex --profile production loads a locked-down config with suggest mode and read-only sandbox
Plugins — since 0.117.0, plugins are first-class with product-scoped syncing at startup¹¹, enabling centralised distribution of approved skills and MCP servers

flowchart TD
    subgraph Governance
        A[AGENTS.md in repo] -->|PR review| B[Approved policies]
        C[config.toml profiles] --> D[Environment-specific settings]
        E[Plugin registry] -->|Startup sync| F[Approved skills + MCP]
    end
    B --> G[Developer workstation]
    D --> G
    F --> G
    G --> H[Codex CLI session]

4.4 Exercises

Sub-agents — set up a planning agent that delegates test writing to three sub-agents working in parallel worktrees.
CI/CD — add a GitHub Actions workflow that uses codex exec to auto-review PRs.
Enterprise config — create two profiles (dev and production) with different approval modes and model selections.

Milestone: You can orchestrate multi-agent workflows, integrate Codex into CI/CD pipelines, and explain your governance model.

Mastery Checklist

Use this as a self-assessment. Tick each item when you can demonstrate it confidently:

Level	Skill	✓
Foundation	Install, authenticate, explain approval × sandbox matrix	☐
Foundation	Navigate the TUI, use `/permissions`, attach images	☐
Configuration	Maintain a layered AGENTS.md stack	☐
Configuration	Customise `config.toml` with profiles	☐
Intermediate	Connect and manage MCP servers	☐
Intermediate	Create and distribute custom skills	☐
Intermediate	Select models by task complexity	☐
Advanced	Orchestrate sub-agents in parallel worktrees	☐
Advanced	Integrate `codex exec` into CI/CD pipelines	☐
Advanced	Implement enterprise governance with profiles and plugins	☐

Citations

Codex CLI official documentation — OpenAI Developers ↩
Codex CLI changelog — Windows launch March 2026 ↩
Codex CLI features — model selection and gpt-5.4 ↩ ↩² ↩³
Codex CLI command reference — approval modes ↩
How to Configure Approval and Sandbox Modes — Inventive HQ ↩
Codex configuration reference ↩
Custom instructions with AGENTS.md — OpenAI Developers ↩ ↩²
[Model Context Protocol — Codex OpenAI Developers](https://developers.openai.com/codex/mcp)

↩ ↩²
[Agent Skills — Codex OpenAI Developers](https://developers.openai.com/codex/skills)

↩ ↩²
The Codex App Super Guide 2026 — Kingy AI ↩ ↩²
Codex CLI changelog — v0.117.0 and v0.118.0 ↩ ↩² ↩³
[Best practices — Codex OpenAI Developers](https://developers.openai.com/codex/learn/best-practices)

↩
Agentic Coding Harnesses: Enterprise Guide — Big Hat Group ↩