Self-Harness: What Autonomous Agent Framework Improvement Means for Codex CLI AGENTS.md and Hook Optimisation

Every Codex CLI practitioner has done it: noticed a recurring failure, opened AGENTS.md, added a rule, tested again, kept or reverted the change. This manual loop — observe failure, hypothesise cause, edit harness, validate — is the dominant workflow for improving agent performance today. Zhang et al.’s “Self-Harness: Harnesses That Improve Themselves” (arXiv:2606.09498, 8 June 2026) formalises that loop and hands it to the agent itself¹. The results are striking: MiniMax M2.5 jumped from 40.5% to 61.9% on Terminal-Bench 2.0, a +52.8% relative gain, with no human engineering and no stronger external model¹.

This article examines Self-Harness’s three-stage loop, what its findings reveal about harness optimisation in general, and how Codex CLI practitioners can apply the same principles — today — using AGENTS.md iteration, PostToolUse hooks, and named profiles.

The Research: Agents That Improve Their Own Operating Framework

The Harness Problem

The “Agent = Model + Harness” formulation is now well established². The harness is everything that is not the model: system prompts, tool definitions, runtime mechanisms, verification rules, orchestration logic, and failure-recovery procedures¹. Different models exhibit different behavioural weaknesses, which means effective harness design is inherently model-specific¹. Yet harnesses are still largely hand-engineered by humans — a process that scales poorly as models diversify and evolve rapidly¹.

The Self-Harness Loop

Self-Harness replaces human engineering with a three-stage iterative loop:

flowchart TD
    A[Run agent on tasks] --> B[Collect execution traces]
    B --> C[Weakness Mining]
    C --> D[Cluster failure signatures]
    D --> E[Harness Proposal]
    E --> F[Generate K candidate modifications]
    F --> G[Proposal Validation]
    G --> H{Improves held-in without\ndegrading held-out?}
    H -->|Yes| I[Accept modification]
    H -->|No| J[Reject modification]
    I --> A
    J --> E

Weakness Mining analyses execution traces from failed tasks, clustering them by verifier-grounded failure signatures — terminal causes and agent mechanisms rather than surface symptoms¹. A surface symptom might be “test failed”; a failure signature might be “agent entered infinite tool-call loop after dependency installation error.”

Harness Proposal generates K diverse yet minimal candidate modifications, each grounded in a primary failure mechanism and mapped to a concrete editable surface: system prompt text, tool wrappers, middleware logic, or runtime configuration¹.

Proposal Validation applies a regression-gated acceptance criterion: a candidate is accepted only if it improves performance on at least one data split without degradation on the other¹. This prevents the common trap where fixing one failure class introduces new regressions.

Quantitative Results

The authors tested Self-Harness across three diverse model families on Terminal-Bench 2.0, a benchmark of 64 containerised terminal tasks spanning ML, systems, security, and biology domains¹:

Model	Baseline	After Self-Harness	Absolute Gain	Relative Gain
MiniMax M2.5	40.5%	61.9%	+21.4 pts	+52.8%
Qwen3.5-35B-A3B	23.8%	38.1%	+14.3 pts	+60.1%
GLM-5	42.9%	57.1%	+14.2 pts	+33.1%

Critically, the modifications were model-specific. MiniMax M2.5 needed early artifact creation, bounded tool-message limits, and redirection after prolonged tool use. Qwen3.5 needed dependency prechecking, retry discipline, and loop-breaking mechanisms. GLM-5 needed persistent environment changes across shell sessions and transitions from exploration to implementation phases¹. No single modification worked universally.

Convergence Behaviour

Performance gains accumulated iteratively, with the system converging after approximately 5–7 proposal rounds per model³. MiniMax reached its final performance through roughly three accepted edits over multiple proposal rounds, suggesting that a small number of well-targeted modifications can deliver outsized gains³.

Why This Matters for Codex CLI

Your Harness Is Your Performance Lever

Self-Harness provides the strongest empirical evidence yet that harness quality dominates model capability for practical agent performance. A 21.4-point absolute gain from harness modifications alone — with no model change, no fine-tuning, no additional compute — should recalibrate how practitioners allocate their optimisation effort.

For Codex CLI, your harness surfaces are:

AGENTS.md — repository-level instructions that shape agent behaviour⁴
config.toml — model selection, token budgets, compaction thresholds, and named profiles⁵
Hooks — PostToolUse, Stop, UserPromptSubmit callbacks that wrap tool execution with validation logic⁵
Named profiles — per-task configuration bundles that adjust the harness for specific workflows⁵

Model-Specific Harness Design

Self-Harness’s most provocative finding is that modifications are model-specific rather than generic instruction additions¹. This has direct implications for Codex CLI’s named profile system. If you switch between GPT-5.5 and o4-mini (or any third-party model via custom providers), your AGENTS.md instructions may need to differ per model.

Codex CLI supports this through AGENTS.override.md, which can provide model-specific guidance that supplements the base AGENTS.md⁴. The Self-Harness results suggest practitioners should maintain separate override files tuned to each model’s behavioural weaknesses:

# config.toml — model-specific profiles with different AGENTS.md guidance
[profiles.deep-reasoning]
model = "o4-mini"
# This profile benefits from explicit step-by-step decomposition instructions

[profiles.fast-implementation]
model = "gpt-5.5"
# This profile benefits from artifact-first instructions and bounded tool limits

The Weakness Mining Analogue: Trace Analysis

Self-Harness’s Weakness Mining stage clusters failures by mechanism, not symptom. Codex CLI practitioners can approximate this with structured trace analysis:

Collect failure traces — Use codex --log-level debug to capture full execution traces from failed sessions⁵
Cluster by mechanism — Group failures by root cause: tool-call loops, dependency errors, context overflow, verification failures, wrong file edits
Map to harness surface — Identify which harness component each failure cluster maps to

flowchart LR
    subgraph Failure Clusters
        A[Tool-call loops]
        B[Dependency errors]
        C[Context overflow]
        D[Verification failures]
    end
    subgraph Harness Surfaces
        E["AGENTS.md rules"]
        F["PostToolUse hooks"]
        G["config.toml budgets"]
        H["Stop hook gates"]
    end
    A --> E
    A --> F
    B --> E
    C --> G
    D --> H

The Proposal Validation Analogue: Regression-Gated AGENTS.md Changes

The most transferable insight from Self-Harness is its regression-gated acceptance criterion. Too many practitioners add rules to AGENTS.md after a single failure without checking whether the new rule breaks existing workflows.

A disciplined approach mirrors Self-Harness’s validation stage:

# 1. Baseline: run your eval suite before the change
codex exec --profile baseline-eval \
  "Run the test suite and report pass/fail counts" \
  > baseline-results.json

# 2. Apply the AGENTS.md modification

# 3. Re-run: check for improvement without regression
codex exec --profile baseline-eval \
  "Run the test suite and report pass/fail counts" \
  > modified-results.json

# 4. Compare: accept only if improved without degradation
diff baseline-results.json modified-results.json

For teams with established eval harnesses, tools like promptfoo or adk eval can automate this comparison with trajectory assertions⁶.

PostToolUse Hooks as Runtime Harness Modifications

Self-Harness’s concrete modifications — bounded tool-message limits, redirection after prolonged tool use, loop-breaking mechanisms — map directly to Codex CLI’s hook system.

Bounded tool limits (MiniMax M2.5’s key modification):

#!/bin/bash
# hooks/post_tool_use.sh — break tool loops after N consecutive failures
FAILURE_COUNT_FILE="/tmp/codex-tool-failures"

if [ "$TOOL_EXIT_CODE" -ne 0 ]; then
    count=$(cat "$FAILURE_COUNT_FILE" 2>/dev/null || echo 0)
    count=$((count + 1))
    echo "$count" > "$FAILURE_COUNT_FILE"

    if [ "$count" -ge 3 ]; then
        echo "⚠️ Three consecutive tool failures detected. Stop and reassess approach." >&2
        echo 0 > "$FAILURE_COUNT_FILE"
        exit 1
    fi
else
    echo 0 > "$FAILURE_COUNT_FILE"
fi

Dependency prechecking (Qwen3.5’s key modification) can be encoded in AGENTS.md:

## Dependency Management
Before running any command that depends on external packages:
1. Check whether the dependency is already installed
2. If not, install it and verify the installation succeeded before proceeding
3. Never assume a package is available — always verify

Automating the Self-Harness Loop with `codex exec`

The most ambitious application is approximating the full Self-Harness loop using Codex CLI itself. The codex exec command can run tasks in batch mode, collect results, and even propose AGENTS.md modifications⁵:

#!/bin/bash
# self-harness.sh — one iteration of the Self-Harness loop

# Stage 1: Run tasks and collect traces
for task in tasks/*.md; do
    codex exec --profile eval-profile \
      "$(cat $task)" \
      --output-schema '{"pass": "boolean", "trace": "string"}' \
      > "traces/$(basename $task .md).json" 2>&1
done

# Stage 2: Mine weaknesses and propose modifications
codex exec --profile harness-engineer \
  "Analyse the failure traces in traces/ directory. \
   Cluster them by failure mechanism. \
   Propose a single, minimal AGENTS.md modification \
   that addresses the most common failure cluster. \
   Output the proposed diff." \
  > proposed-modification.diff

# Stage 3: Apply, validate, accept/reject
# (apply diff, re-run eval, compare results)

Community tools like pi-reflect already automate parts of this workflow, generating self-reviews that evolve AGENTS.md rules from actual mistakes⁷.

Limitations and Caveats

Self-Harness has important limitations that practitioners should weigh:

Benchmark overfitting risk — The paper evaluates on Terminal-Bench 2.0 but does not provide direct comparisons with human-engineered harnesses¹. The modifications may be overfitted to the benchmark’s task distribution rather than reflecting generalisable improvements.

Computational overhead — Full benchmark re-evaluation per iteration is expensive³. For Codex CLI practitioners, this means the Self-Harness loop is best suited to teams with established eval suites that can run affordably.

Fundamental capability gaps — Self-Harness cannot address cases where the model simply lacks the knowledge or reasoning capacity for a task¹. Harness modifications amplify existing capabilities; they do not create new ones.

Production harness applicability — ⚠️ The paper does not evaluate whether Self-Harness works on already-optimised production harnesses. Gains may be largest when starting from a naive baseline³.

Practical Takeaways

Treat AGENTS.md as code — version it, diff it, regression-test it. Self-Harness’s regression-gated acceptance criterion should be your default for any harness modification.
Model-specific configurations matter — Use named profiles and AGENTS.override.md to maintain model-specific harness tuning rather than one-size-fits-all instructions.
Cluster failures by mechanism, not symptom — When analysing agent failures, look for the behavioural pattern (tool loops, dependency assumptions, exploration without implementation) rather than the surface error.
Small, targeted modifications beat large rewrites — Self-Harness achieved +21.4 points through roughly three accepted edits. Prefer minimal, specific AGENTS.md additions over comprehensive rewrites.
Automate the feedback loop — Use codex exec batch runs, PostToolUse hooks for runtime data collection, and structured eval comparisons to approximate the Self-Harness loop in your own workflow.

Citations

Zhang, H., Zhang, S., Li, K., Zhang, C., Chen, Y., Zhang, Y., Bai, L. & Hu, S. (2026). “Self-Harness: Harnesses That Improve Themselves.” arXiv:2606.09498. https://arxiv.org/abs/2606.09498 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³
Bui, N. D. Q. (2026). “Building Effective AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned.” arXiv:2603.05344. https://arxiv.org/abs/2603.05344 ↩
“Self-Harness: AI Agents That Autonomously Improve Their Own Framework.” explainx.ai, 10 June 2026. https://explainx.ai/blog/self-harness-agents-improve-themselves-arxiv-2026 ↩ ↩² ↩³ ↩⁴
OpenAI. “AGENTS.md — Codex CLI.” OpenAI Developers. https://developers.openai.com/codex/cli/agents-md ↩ ↩²
OpenAI. “CLI — Codex.” OpenAI Developers. https://developers.openai.com/codex/cli ↩ ↩² ↩³ ↩⁴ ↩⁵
OpenAI. “Changelog — Codex.” OpenAI Developers. https://developers.openai.com/codex/changelog ↩
“pi-reflect: Self-improving behavioral files for coding agents.” Community tool for automated AGENTS.md self-reviews. https://github.com/bradAGI/awesome-cli-coding-agents ↩

Self-Harness: What Autonomous Agent Framework Improvement Means for Codex CLI AGENTS.md and Hook Optimisation

The Research: Agents That Improve Their Own Operating Framework

The Harness Problem

The Self-Harness Loop

Quantitative Results

Convergence Behaviour

Why This Matters for Codex CLI

Your Harness Is Your Performance Lever

Model-Specific Harness Design

The Weakness Mining Analogue: Trace Analysis

The Proposal Validation Analogue: Regression-Gated AGENTS.md Changes

PostToolUse Hooks as Runtime Harness Modifications

Automating the Self-Harness Loop with codex exec

Limitations and Caveats

Practical Takeaways

Citations

Automating the Self-Harness Loop with `codex exec`