Evaluating Codex Agents: Evals, Long-Horizon Benchmarks, and the 4-File Pattern

5 minute read

Research compiled 2026-03-28. Sources: OpenAI eval-skills guide, Run long horizon tasks with Codex (Feb 23, 2026), LongCLI-Bench (Feb 2026).

Agentic coding breaks the assumptions that make unit tests sufficient. When Codex runs for hours or spawns a dozen subagents, “did the tests pass?” is necessary but not enough — it doesn’t tell you whether the agent took the right path, consumed a reasonable number of tokens, or avoided modifying files it shouldn’t have touched.

This article covers three layers of Codex evaluation: skill-level evals for short tasks, the 4-file durable memory pattern for long-horizon runs, and what LongCLI-Bench reveals about the state of the art.

Layer 1: Skill Evals (Short Tasks, ~10–20 Prompts)

OpenAI’s eval-skills guide defines evaluation for Codex agents as: a prompt → a captured run (trace + artifacts) → a set of checks → a comparable score.

Three goal types to measure:

Goal type	Question	How to check
Process	Did Codex invoke the right tools in the right order?	JSONL trace: check `command_execution` events
Style	Does output follow project conventions?	`--output-schema` rubric grading
Efficiency	Did it get there without thrashing?	Token count + step count thresholds

The core eval loop:

# 1. Run skill and capture trace
codex exec --json "Run the release notes skill" > trace.jsonl

# 2. Deterministic checks: did it call expected tools?
cat trace.jsonl | jq '.events[] | select(.type == "command_execution") | .command'

# 3. Structured rubric grading
codex exec --output-schema release-eval-schema.json "Grade this output: $(cat output.md)"

Scale guidance: 10–20 prompts is enough to surface regressions in a single skill. Grow the dataset as you encounter real failures. For CI, use an 80–90% gate: N out of M eval samples must pass.

Measure variance: The same prompt can produce different results. Run with --repeat 3 and compare:

codex exec --repeat 3 "Run the security audit skill" 2>&1 | grep "pass_rate"

Cost guardrails: Set thresholds to catch regressions:

A security audit skill should cost < $0.25 and complete in < 30 seconds
If latency or cost spikes, the agent is thrashing

Layer 2: Long-Horizon Run Patterns (Hours, Not Minutes)

The OpenAI blog post (Feb 23, 2026) documents a GPT-5.3-Codex run of ~25 hours, 13M tokens, ~30K lines of code to build a design tool from scratch. The key enabler wasn’t the model — it was the 4-file durable project memory pattern.

The 4-File Pattern

Instead of relying on context window alone, maintain four markdown files as external state:

File	Purpose
`Prompt.md`	Frozen spec — non-negotiable requirements, never modified during the run
`Plan.md`	Verifiable milestones with acceptance criteria (checked off as work proceeds)
`Implement.md`	Operational instructions, validation protocols, coding conventions
`Documentation.md`	Real-time status and decision log — Codex updates this as it works

Why it works: Context compaction prunes conversation history, but these files survive. When Codex compacts and re-reads them, it regains orientation without losing project state. Prompt.md acts as a constraint anchor — the model can’t drift from the original spec.

AGENTS.md integration:

## Long Horizon Run Protocols

When given a multi-day task:
Read Prompt.md — do not modify
Read Plan.md — update status as milestones complete
Read Documentation.md — append decisions and blockers
After each milestone: run verification commands (lint, typecheck, tests, build)
Repair failures before continuing to next milestone

Practical verification rhythm: After each milestone, Codex ran lint checks, typechecks, the test suite, and a build. Failures were repaired before continuing. This continuous verification prevented error accumulation over the long run.

Layer 3: LongCLI-Bench — What Research Says

LongCLI-Bench (arXiv:2602.14337, submitted Feb 15, 2026) is the first benchmark designed specifically to test long-horizon CLI agent behaviour at realistic task lengths.

Key findings:

All tested agents — including Claude Code (Claude-Opus-4.6) and Codex (GPT-5.3-Codex) — achieve < 20% pass rate
Most failures occur in the early stage of tasks (< 30% completion), not at the finish line
Self-correction provides only marginal improvement — the bottleneck is initial planning, not in-flight correction
Human-agent collaboration (plan injection, interactive guidance) significantly outperforms fully autonomous runs

Benchmark design:

20 curated tasks across 4 engineering categories: from-scratch development, feature addition, bug fixing, refactoring
Dual-set testing: requirement fulfillment (fail-to-pass) + regression avoidance (pass-to-pass)
Expert completion time average: 1,000+ minutes per task (vs 207 min for Terminal-Bench@2)
Step-level scoring gives granular failure analysis

Implication for Daniel’s agentic pod design: The < 20% autonomous pass rate reinforces why plan injection (the 4-file pattern) and human milestone checkpoints produce better outcomes than pure “fire-and-forget” automation. The data supports a supervised autonomy model: agent executes, human reviews milestones, agent continues.

Practical rule from LongCLI-Bench: Front-load planning. Failures cluster at the start of tasks, not the end. A well-structured Plan.md with clear acceptance criteria dramatically reduces early-stage failures.

Putting It Together: The Evaluation Stack

Short tasks (< 30 min)
  → Skill evals: 10-20 prompts, JSONL trace checks, --output-schema rubric
  → CI gate: 80% pass rate, cost < $0.25, latency < 30s

Medium tasks (30 min – 4 hours)
  → TDD feedback loop: tests define "done", agent iterates until green
  → PostToolUse hook: auto-run tests after every file write
  → /fork to explore alternatives without losing main progress

Long tasks (4+ hours)
  → 4-file durable memory: Prompt.md + Plan.md + Implement.md + Documentation.md
  → Milestone verification: lint + typecheck + tests + build after each milestone
  → gpt-5.1-codex-max for compaction-aware extended runs
  → xhigh reasoning effort for tasks that genuinely resist solution at lower budgets
  → Human checkpoint at each milestone (not continuous monitoring)

Model Selection for Evaluation Tasks

Scenario	Model	Reasoning	Why
Eval runs (fast, many repeats)	`gpt-5-codex-mini`	`low`	4x more included usage, 2x faster
Standard task evaluation	`gpt-5-codex`	`medium`	Default; good quality/cost balance
Long-horizon autonomous run	`gpt-5.1-codex-max`	`high`	Compaction-aware; designed for extended runs
Hard algorithmic / security eval	`gpt-5.3-codex`	`xhigh`	For tasks that resist solution at lower budgets

Key Takeaway

Evaluation for Codex agents is a spectrum. At the short end, JSONL traces + rubric scoring + 10-prompt eval sets catch regressions in skills cheaply. At the long end, the 4-file durable memory pattern + milestone verification enables coherent 25-hour runs. The benchmark research confirms that neither end is “solved” — but structured context management is what separates 10-minute demos from production-grade agentic workflows.

*Related: notes/best-practices.md

articles/2026-03-28-test-first-development-codex-tdd-feedback-loop.md

articles/2026-03-28-codex-agent-loop-deep-dive.md*