Software Delegation Contracts: What Reviewability Research Reveals About Trusting Coding Agent Output — and How Codex CLI’s Guardian Architecture Delivers It

The Reviewability Problem

Every team that adopts a coding agent eventually hits the same wall: the agent produces work faster than humans can verify it. The bottleneck is no longer generation — it is review. A diff that took seconds to produce can take twenty minutes to understand, and without structured evidence of what was changed and why, reviewers default to either rubber-stamping or rejecting wholesale.

Müller, Hess and Koziolek formalised this problem in their June 2026 paper Software Delegation Contracts: Measuring Reviewability in AI Coding-Agent Work ¹. Their controlled study of 64 agent executions across two model tiers demonstrates that explicit delegation contracts — structured agreements covering task scope, authority boundaries, expected evidence, and acceptance criteria — buy measurable reviewability gains even when they do not improve correctness.

This article examines the paper’s findings, maps them onto Codex CLI’s existing architecture, and provides a practical configuration stack for teams who want reviewable agent output by default.

What the Research Found

Study Design

The researchers built a dependency-free TypeScript API task environment with seeded defects and documentation gaps ¹. Ten tasks across five families were executed under three conditions:

Issue-style prompt — a realistic but unstructured task description
Explicit delegation contract — structured scope, authority, and deliverables
Contract + evidence bundle — the contract plus a required evidence package (changed-file lists, known limitations, residual-risk sections, reviewer checklists)

Each of the 64 runs was scored against hidden acceptance tests, mutation checks, and scope analysis, then reviewed by three condition-blinded model-based reviewers using a fixed rubric (192 reviews total) ¹.

Key Findings

graph LR
    A[Issue Prompt] -->|Baseline| B[Reviewability Score]
    C[Delegation Contract] -->|+0.83 pts| B
    D[Contract + Evidence] -->|Best| B
    B --> E[Evidence sufficiency ↑ in 22/30 pairs]
    B --> F[Reviewer ambiguity ↓ p=0.035]
    B --> G[Zero worsened comparisons]

Metric	Improvement	Significance
Evidence sufficiency	+0.83 on 5-point scale	p < 0.0001, Cliff’s δ = 0.66
Reviewer ambiguity	Decreased	p = 0.035
Changed-file lists	Present only with contract	—
Known-limitations sections	Present only with contract	—
Residual-risk disclosure	Present only with contract	—

Critically, all 64 runs passed hidden acceptance checks with zero scope violations ¹. The contracts did not improve correctness — they improved the human’s ability to trust and verify that correctness.

The Cost

Contracts cost +13% agent tokens and +38% wall-clock time, with larger effects for the weaker model tier ¹. This is the central trade-off: reviewability is not free, but the alternative — unverifiable output that blocks merge queues — is more expensive.

Mapping Delegation Contracts to Codex CLI

The delegation contract framework decomposes into four components: Task (what to do), Authority (what the agent may touch), Work Package (what comes back), and Acceptance Context (how the reviewer decides) ¹. Codex CLI’s architecture maps to each:

graph TD
    subgraph "Delegation Contract"
        T[Task]
        AU[Authority]
        WP[Work Package]
        AC[Acceptance Context]
    end
    subgraph "Codex CLI"
        T --> G["/goal + AGENTS.md"]
        AU --> P["Permission Profiles + Sandbox"]
        WP --> S["--output-schema + PostToolUse hooks"]
        AC --> GU["Guardian auto-review subagent"]
    end

Task: Goals and AGENTS.md

The /goal command encodes persistent objectives with structured completion criteria ². Combined with AGENTS.md constraint encoding, this serves as the task portion of the delegation contract — explicit scope that survives compaction and session interruptions.

# .codex/config.toml — Goal mode with structured completion
[goal]
require_verification = true
verification_strategy = "test_and_review"

Authority: Permission Profiles

Codex CLI’s permission profiles implement authority boundaries at the OS level ³. The :workspace profile restricts writes to active workspace roots; custom profiles can further narrow filesystem and network access:

[permissions.review-safe]
extends = ":workspace"
filesystem.deny_write = ["*.lock", "package.json", ".env*"]
network.allow = ["registry.npmjs.org"]

This is the authority boundary the paper calls for — explicit, machine-enforced, and visible to the reviewer.

Work Package: Structured Output and Evidence

The codex exec --output-schema flag forces structured JSON output that can include changed-file manifests, rationale, and known limitations ⁴:

codex exec \
  --output-schema '{"changed_files": ["string"], "rationale": "string", "known_limitations": ["string"], "residual_risks": ["string"]}' \
  --prompt-file .github/codex/prompts/feature-task.md

PostToolUse hooks can enforce that every file-write operation is accompanied by structured metadata:

[[hooks]]
event = "PostToolUse"
match_tool = "file_write"
command = "scripts/evidence-collector.sh"

The hook script accumulates a CHANGES.md evidence bundle as the agent works, producing exactly the artefact the paper found most improves reviewability.

Acceptance Context: Guardian Auto-Review

Codex CLI’s Guardian subagent — an architecturally distinct reviewer running the purpose-built codex-auto-review model ⁵ — evaluates each boundary-crossing request against a security policy. This maps to the acceptance context: an independent assessor that produces structured verdicts with risk level, authorisation decision, and human-readable rationale.

# Enable Guardian with escalation for high-risk operations
[approvals]
approvals_reviewer = "auto_review"
escalate_on = ["network_write", "file_delete", "exec_unknown"]

The Guardian catches 96.1% of malicious behaviour while reducing human interruptions by approximately 200× ⁵ — a concrete implementation of the paper’s finding that structured contracts reduce reviewer ambiguity.

Practical Configuration: The Delegation Contract Stack

Combining the research findings with Codex CLI’s feature set, here is a production-ready configuration that implements delegation contracts:

# .codex/config.toml — Delegation Contract Configuration

# 1. Task: Structured goals with verification
[goal]
require_verification = true

# 2. Authority: Bounded permissions
default_permissions = "review-safe"

[permissions.review-safe]
extends = ":workspace"
filesystem.deny_write = ["*.lock", ".env*", "infrastructure/"]

# 3. Work Package: Token budget controls diff size
rollout_token_budget = 50000

# 4. Acceptance: Guardian review with escalation
[approvals]
approvals_reviewer = "auto_review"

# PostToolUse hook to build evidence bundle
[[hooks]]
event = "PostToolUse"
match_tool = "file_write"
command = "scripts/build-evidence-bundle.sh"

# Stop hook to enforce evidence completeness
[[hooks]]
event = "Stop"
command = "scripts/verify-evidence-bundle.sh"

The Evidence Bundle Script

#!/usr/bin/env bash
# scripts/build-evidence-bundle.sh
# Accumulates structured evidence as the agent works

INPUT=$(cat)
FILE=$(echo "$INPUT" | jq -r '.tool_call.arguments.path // empty')

if [[ -n "$FILE" ]]; then
  echo "$FILE" >> .codex/evidence/changed-files.txt
fi

echo '{"continue": true}'

The Verification Gate

#!/usr/bin/env bash
# scripts/verify-evidence-bundle.sh
# Blocks completion if evidence bundle is incomplete

CHANGES=$(wc -l < .codex/evidence/changed-files.txt 2>/dev/null || echo 0)

if [[ "$CHANGES" -eq 0 ]]; then
  echo '{"continue": false, "stopReason": "No evidence bundle generated"}'
else
  echo '{"continue": true}'
fi

The +13% Token Cost Is Worth It

The paper’s finding that contracts cost +13% tokens but produce measurably reviewable output ¹ aligns with production experience. The rollout_token_budget setting in Codex CLI ⁶ provides the lever: teams can allocate a portion of their token budget explicitly to evidence generation, knowing the downstream review time savings exceed the upstream generation cost.

For a typical feature task consuming 40,000 tokens without contracts, the +13% overhead adds approximately 5,200 tokens — roughly £0.03 at current GPT-5.1-Codex-Mini pricing ⁷. The alternative is a 20-minute reviewer session that could have been reduced to 5 minutes with proper evidence.

When Contracts Matter Most

The paper studied small, bounded tasks where all runs achieved correctness regardless of condition ¹. In production, tasks are larger, more ambiguous, and more likely to fail. The reviewability benefits compound as:

Diff size grows — larger changes are exponentially harder to review without a changed-file manifest
Multiple agents collaborate — subagent delegation creates nested work packages that need their own evidence chains
Compliance requires audit trails — regulated industries need provable scope adherence, not just passing tests
Model tiers vary — the paper found weaker models benefit more from contracts, and mixed-model workflows (using gpt-5.1-codex-mini for routine tasks) are standard practice ⁸

Limitations and Open Questions

The study has acknowledged constraints that practitioners should note:

Scale: Ten tasks across five families in a single TypeScript environment is a pilot, not a population study ¹
Correctness ceiling: All runs passed — the interaction between contracts and failure-prone tasks remains unstudied
Model-based reviewers: The 192 reviews used model-based assessors, not human developers; calibration with human judgement is assumed but unvalidated ⚠️
Evidence overhead at scale: Whether the +38% wall-clock cost holds for 500-line diffs versus the study’s smaller tasks is unknown ⚠️

Conclusion

The delegation contract framework provides the first empirical evidence that how you ask an agent to work changes how reviewable the output becomes — independent of correctness. Codex CLI’s architecture already implements each contract component through permission profiles (authority), /goal + AGENTS.md (task), --output-schema + PostToolUse hooks (work package), and Guardian auto-review (acceptance context).

The practical implication is clear: configure the evidence-generation stack, accept the +13% token overhead, and transform the review bottleneck from “can I understand this diff?” to “does the evidence bundle match the contract?”

Citations

Müller, S., Hess, T. & Koziolek, H. (2026). Software Delegation Contracts: Measuring Reviewability in AI Coding-Agent Work. arXiv:2606.17099. https://arxiv.org/abs/2606.17099 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹
OpenAI. (2026). Goal Mode — Codex CLI Features. https://developers.openai.com/codex/cli/features ↩
OpenAI. (2026). Permissions — Codex. https://developers.openai.com/codex/permissions ↩
OpenAI. (2026). Non-interactive mode — Codex. https://developers.openai.com/codex/noninteractive ↩
OpenAI. (2026). Auto-review — Codex. https://developers.openai.com/codex/concepts/sandboxing/auto-review ↩ ↩²
OpenAI. (2026). Configuration Reference — Codex. https://developers.openai.com/codex/config-reference ↩
OpenAI. (2026). Codex Rate Card. https://help.openai.com/en/articles/20001106-codex-rate-card ↩
Murphy-Hill, E., Butler, S. & Savelieva, A. (2026). CLI Coding Agent Adoption at Scale. arXiv:2607.01418. https://arxiv.org/abs/2607.01418 ↩