Sketchnote: Codex CLI in Practice: Real-World Benchmarks and What They Mean

Codex CLI in Practice: Real-World Benchmarks and What They Mean

Benchmark numbers dominate marketing copy, but most developers lack the context to interpret them critically. A model claiming “80% on SWE-bench” means something very different from “56% on SWE-bench Pro”, and neither number directly answers “will this make me more productive?” This article unpacks the benchmark landscape as it stands in early 2026, explains the methodology behind each evaluation, and gives you a practical framework for translating scores into tooling decisions.

The Benchmark Landscape

Three evaluations dominate discourse around AI coding agents in 2026:

Benchmark	Tasks	Scope	Contamination resistance
SWE-bench Verified	500 Python tasks	Bug-fix patches	Low — OpenAI confirmed contamination¹
SWE-bench Pro	1,865 multi-language tasks	Multi-file changes	High — GPL + proprietary codebases²
Terminal-Bench 2.0	89 end-to-end tasks	Full terminal workflows	High — Docker environments, crowdsourced³

Each measures a different capability slice, and the choice of benchmark often reflects whose model performs best on it.

SWE-bench: The Origin and Its Variants

What SWE-bench Measures

SWE-bench, from Princeton, evaluates LLMs on real GitHub issues drawn from 12 popular Python repositories.⁴ The agent receives the issue description, the repository at that commit, and must generate a patch. Success is binary: apply the patch, run the existing test suite — pass or fail, no partial credit.⁵

This makes it meaningfully harder than code-generation benchmarks (HumanEval, MBPP): the model must navigate an unfamiliar codebase, understand cross-file interactions, and produce a patch that satisfies existing tests written by a different engineer.

SWE-bench Verified

In August 2024, OpenAI released SWE-bench Verified — 500 human-validated tasks from the original 2,294.⁶ It became the de-facto standard for comparing agents.

The problem: by 2026, OpenAI’s own audit confirmed every frontier model (GPT-5.2, Claude Opus 4.5, Gemini 3 Flash) can reproduce verbatim gold patches for a subset of tasks.¹ 59.4% of hard tasks have flawed tests. OpenAI has since stopped self-reporting Verified scores.

SWE-bench Verified still differentiates weaker models, but treat any score above ~60% from a frontier model with scepticism.

SWE-bench Pro

SWE-bench Pro addresses Verified’s limitations directly:²

1,865 tasks vs. 500
Multi-language (not Python-only)
Average patch: 107 lines across 4.1 files (vs. Verified’s median of 4 lines)
Sourced from GPL-licensed and proprietary codebases to create legal barriers to training-data inclusion
Includes a private held-out subset visible only on the Scale Labs leaderboard

The resulting scores are humbling. Claude Opus 4.5 scores 80.9% on Verified but 45.9% on Pro — same model, roughly half the score.² On the private subset, the gap widens further: Claude Opus 4.1 drops from 22.7% to 17.8%, GPT-5 from 23.1% to 14.9%.² These drops are the clearest signal that Verified scores partially reflect memorisation.

For Codex CLI with GPT-5.3-Codex, OpenAI reports 56.8% on SWE-bench Pro (Public).⁷ That number uses OpenAI’s own agent scaffolding. On the Scale AI SEAL leaderboard with standardised scaffolding, GPT-5 (High) scores 41.8% and GPT-5.2 Codex 41.0%.² Custom harnesses add 4–12 points.

Terminal-Bench 2.0: The CLI-Native Benchmark

Structure

Terminal-Bench 2.0, from the Laude Institute, consists of 89 curated tasks that run inside Docker containers.³ Each task has:

A natural-language instruction
A sandboxed environment
A test script that verifies the outcome automatically

Task categories span software engineering, machine learning model training, system administration, security, data science, and cybersecurity.³ Tasks were crowd-sourced from 93 contributors, with 229 submissions narrowed to 89 based on quality and difficulty.³

This is the benchmark where Codex CLI’s terminal-native strengths are most visible.

Running Terminal-Bench Yourself

The benchmark runs via the Harbor framework:³

# Install Harbor
pip install harbor-framework

# Run Terminal-Bench 2.0 against Codex CLI
harbor run -d terminal-bench@2.0 --agent codex-cli --model gpt-5.3-codex

Costs range from roughly $1 to $100 per full run depending on model pricing.³

Current Scores (March 2026)

Agent + Model	Score
ForgeCode + Claude Opus 4.6	81.8%
ForgeCode + GPT-5.4	81.8%
TongAgents + Gemini 3.1 Pro	80.2%
ForgeCode + Gemini 3.1 Pro	78.4%
SageAgent + GPT-5.3-Codex	78.4%
Codex CLI + GPT-5.2	63%

The frontier is clustered between 78–82%; no model dominates.⁸

A notable finding: Codex CLI’s resolution rate increases by 52% when using GPT-5.2 instead of GPT-5-Nano.⁹ Model capability matters more than scaffold choice when optimising for Terminal-Bench performance.

The Scaffolding Effect

Agent scaffolding — the orchestration layer, memory management, tool-use protocol, retry logic — can matter as much as the underlying model. The Confucius Code Agent (CCA), using Claude 4 Sonnet, achieves 74.6% on SWE-bench, outperforming a mini-SWE-Agent variant using the more capable Claude 4.5 Sonnet.¹⁰ Better orchestration closed — and surpassed — a one-generation model gap.

This is relevant for Codex CLI users: the CLI’s native subagent support, AGENTS.md context loading, and hook system compose the scaffolding layer. A well-written AGENTS.md that accurately scopes the codebase and a SessionStart hook that pre-loads relevant context will measurably improve outcomes on tasks equivalent to benchmark scenarios.

flowchart TD
    A[User Prompt] --> B[Codex CLI Scaffold]
    B --> C{Context Loading}
    C -->|AGENTS.md| D[Codebase Scope]
    C -->|SessionStart Hook| E[Pre-loaded Context]
    C -->|MCP Tools| F[External Resources]
    D & E & F --> G[Model — gpt-5.3-codex / gpt-5.4]
    G --> H{Reasoning + Planning}
    H --> I[Subagents spawned]
    H --> J[Direct patch generation]
    I & J --> K[Test Verification]
    K -->|Pass| L[✅ Resolved]
    K -->|Fail| H

Goodhart’s Law and the Contamination Problem

When a measure becomes a target, it ceases to be a good measure. — Goodhart’s Law

This is playing out in AI coding benchmarks right now.¹¹ Once SWE-bench Verified became the primary ranking mechanism, model developers optimised for it — through fine-tuning, scaffolding tuning, and, critically, training data inclusion. The result is Verified scores that no longer meaningfully discriminate between frontier models.

SWE-bench Pro’s response is structural: GPL and proprietary codebases create legal barriers to inclusion in training data, and the multi-file, multi-language task format resists the 4-line Python patch memorisation pattern that inflated Verified scores.²

A starker illustration: on SWE-EVO (a benchmark testing sustained evolution of existing systems), GPT-5 with OpenHands scores 21% — compared to 65% on SWE-bench Verified.¹² On commercial/enterprise codebases, the best models score under 20%.¹² Real codebases are harder than benchmarks suggest.

What the Numbers Actually Mean for You

Translating benchmark scores into tooling decisions requires a few calibrations:

1. Match the benchmark to your workload

Your primary use case	Most predictive benchmark
Bug fixing in Python repos	SWE-bench Pro (not Verified)
Terminal automation, devops, ML pipelines	Terminal-Bench 2.0
Multi-language, multi-file features	SWE-bench Pro private subset
Greenfield development	⚠️ No benchmark captures this well

2. Subtract the scaffold inflation

Any score citing a vendor’s own scaffolding inflates by 4–12 points versus standardised evaluation.² For apples-to-apples comparison, prefer SEAL leaderboard numbers or Terminal-Bench (which specifies the agent framework explicitly).

3. Model selection within Codex CLI

Given current benchmarks, the practical hierarchy for Codex CLI is:

# config.toml — model selection by task type
[model]
# Highest reasoning quality — complex multi-file tasks
default = "gpt-5.4"

# Specialist coding tasks, benchmarks on SWE-bench Pro
# 56.8% SWE-bench Pro (OpenAI scaffold), 78.4% Terminal-Bench 2.0
coding = "gpt-5.3-codex"

# Subagent tasks, exploration, large-file review
subagent = "gpt-5.4-mini"

For tasks analogous to SWE-bench scenarios (isolated bug fix, single PR scope), gpt-5.3-codex is the purpose-built choice. For broader agentic tasks with planning and computer use, gpt-5.4 leads.⁷

4. Run your own micro-benchmark

The only benchmark that truly matters is performance on your codebase. Terminal-Bench 2.0 offers a useful template: 5–10 representative tasks from your actual workflows, each with a Docker environment and a deterministic test. Run them against each model and scaffold configuration you’re considering. Costs are low enough ($1–10 for a small suite) that this is practical.³

The Benchmark Hierarchy in 2026

graph LR
    A[SWE-bench Verified\n⚠️ Contaminated\nFrontier models ~80%] -->|replaced by| B[SWE-bench Pro\nFrontier models 40-57%\nMulti-language, GPL-licensed]
    B -->|private subset\nlower scores| C[SWE-bench Pro Private\nFrontier models 15-25%\nHardest contamination test]
    D[Terminal-Bench 2.0\n89 tasks, Docker\nFrontier models 63-82%] -->|domain-specific| E[Your Own Benchmark\nYour codebase\nYour tasks]
    B --> E

Summary

SWE-bench Verified is contaminated at the frontier. Stop citing it as the gold standard — it isn’t.
SWE-bench Pro is the honest measure. Scores are 30–40 points lower than Verified; that’s the reality.
Terminal-Bench 2.0 is the most relevant benchmark for Codex CLI users — it tests the kind of end-to-end terminal workflows the tool is built for.
Scaffolding adds 4–12 points over standardised baselines. Don’t compare vendor-reported scores against SEAL scores directly.
Model selection matters more than scaffold when optimising for Terminal-Bench (52% improvement from GPT-5-Nano to GPT-5.2).⁹
The best benchmark is your own codebase. Use Terminal-Bench’s task format as a template.

Citations

OpenAI audit confirming contamination on SWE-bench Verified: SWE-Bench Pro Leaderboard (2026): Why 46% Beats 81% ↩ ↩²
SWE-bench Pro methodology and score comparisons: Scale Labs Leaderboard: SWE-Bench Pro (Public Dataset) ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷
Terminal-Bench 2.0 structure, tasks, costs, and Harbor framework: Terminal-Bench 2.0 and Terminal-Bench 2.0 launches alongside Harbor — VentureBeat ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷
SWE-bench origin, methodology (Princeton, 12 repos, 2,294 tasks): GitHub — SWE-bench/SWE-bench ↩
SWE-bench pass/fail evaluation and patch-application methodology: Introducing SWE-bench Verified — OpenAI ↩
SWE-bench Verified: 500 human-validated tasks released August 2024: Introducing SWE-bench Verified — OpenAI ↩
GPT-5.3-Codex benchmark scores (SWE-bench Pro 56.8%, Terminal-Bench 2.0 77.3%): OpenAI debuts GPT-5.3-Codex — Neowin ↩ ↩²
Terminal-Bench 2.0 leaderboard top scores (March 2026): Terminal-Bench 2.0 — tbench.ai ↩
Codex CLI resolution rate increase (52%) from model upgrade on Terminal-Bench: Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in CLI ↩ ↩²
Confucius Code Agent outperforming higher-capability model via scaffold: Confucius Code Agent — arXiv ↩
Goodhart’s Law in AI agent benchmarks: Goodhart’s Law Is Now an AI Agent Problem — DEV Community ↩
SWE-EVO benchmark showing 21% vs 65% gap, enterprise codebase performance: SWE-EVO: Benchmarking Coding Agents — arXiv ↩ ↩²