The Productivity-Reliability Paradox: Why 98 Per Cent More Pull Requests Broke Nothing — Except Your Review Pipeline — and How Specification Governance Fixes It with Codex CLI

The Numbers That Should Worry You

Your team adopted Codex CLI six months ago. Individual throughput is up. Pull requests are flowing. Everyone feels faster. And yet, lead time is flat, rollbacks haven’t budged, and reviewers are drowning.

You are not imagining this. In May 2026, Sabry Farrag of the University of East London published a systematic multivocal literature review spanning 67 sources (January 2022 through April 2026) that gave this phenomenon a name: the Productivity-Reliability Paradox (PRP) ¹. The headline statistics paint an uncomfortable picture:

Teams using AI coding tools merge 98% more pull requests ²
PR review time increases by 91% ²
Average PR size grows by 154% ²
Bug counts rise by 9% ²
Organisational DORA metrics remain flat ³

Meanwhile, the most methodologically rigorous randomised controlled trial — 16 experienced open-source developers across 246 tasks — found that AI tooling produced a 19% slowdown, despite developers forecasting a 24% speed-up beforehand and estimating a 20% speed-up afterwards ⁴.

The paradox is real, it is structural, and it will not fix itself.

Anatomy of the Paradox

Farrag identifies three moderating variables that determine whether AI-assisted coding delivers genuine gains or amplifies organisational drag ¹:

1. Task Abstraction Level

Well-scoped, self-contained tasks (adding a utility function, writing a data-transfer object) consistently show 20–56% productivity gains across multiple controlled studies ⁵. Cross-cutting changes touching multiple modules — the kind that matter in production systems — show far weaker or negative effects.

2. Codebase Maturity

Greenfield projects benefit most. Mature codebases with established conventions, implicit architectural constraints, and accumulated technical debt punish AI-generated code that plausibly compiles but subtly violates project norms.

3. Developer Experience

Senior developers suffer the “verification tax”: an average of 4.3 minutes per AI suggestion reviewing code that looks correct but may not be ¹. Junior developers face a different problem — they accept suggestions they cannot fully evaluate, creating latent defects.

graph TD
    A[AI Code Generation] --> B{Task Scope?}
    B -- Self-contained --> C[20-56% Faster]
    B -- Cross-cutting --> D[Marginal / Negative]

    C --> E{Codebase?}
    D --> E
    E -- Greenfield --> F[Gains Realised]
    E -- Mature --> G[Convention Violations]

    F --> H{Review Pipeline}
    G --> H
    H --> I[91% Longer Reviews]
    I --> J[Flat DORA Metrics]

    style J fill:#f66,stroke:#333
    style I fill:#f96,stroke:#333

The Two Amplifying Mechanisms

Even where individual gains are real, two mechanisms erode them at the organisational level ¹:

Code Review Bottleneck. AI-generated code is plausible by construction, not correct by construction ¹. Every additional PR requires human review. When generation accelerates without proportional acceleration of the review pipeline, queues grow. The Faros AI 2026 study — doubling the sample to 22,000 developers across 4,000 teams — showed the paradox worsening: bugs per developer up 54%, production incidents per PR tripled ⁶.

Context Window Constraint. Coding agents operate within finite context windows. As sessions lengthen, earlier specification context gets compacted or evicted, causing the agent to drift from the original intent. The semantic distance between what the developer means and what the program does widens silently.

The Specification Governance Model

Farrag’s core thesis is pointed: specification discipline, not model capability, is the binding constraint on AI-assisted software dependability ¹.

Drawing on Transaction Cost Economics ⁷, the paper frames AI code generation as a principal-agent transaction characterised by:

Property	Effect
Asset specificity	Domain-specific code has limited portability, creating lock-in costs
Behavioural uncertainty	Non-deterministic generators produce unpredictable output
Frequency	Dozens of daily invocations amortise governance investment

This combination makes upfront specification investment economically rational — the same logic that drives formal contracts in high-frequency, high-uncertainty procurement.

Four Governance Levels

The Specification Governance Model (SGM) proposes four levels of increasing rigour ¹:

graph LR
    A[Post-hoc Review] --> B[Natural-Language Spec]
    B --> C[Executable Contract]
    C --> D[Constitutional Governance]

    style A fill:#9f9,stroke:#333
    style B fill:#ff9,stroke:#333
    style C fill:#f96,stroke:#333
    style D fill:#f66,stroke:#333

Task Profile	Recommended Level
Self-contained, greenfield, cosmetic risk	Post-hoc review
Self-contained, mature codebase, functional risk	Natural-language spec + tests
Cross-cutting, any codebase, functional risk	Executable contract
Cross-cutting, any codebase, security/data risk	Constitutional governance

Practical Instantiations: Spec Kit and TDAD

Two tools validate the SGM framework in practice.

GitHub Spec Kit

Released under MIT licence in May 2026 and now at v0.11.0 with 93,000+ GitHub stars ⁸, Spec Kit provides a four-phase workflow:

/speckit.constitution — non-negotiable project principles
/speckit.specify — formal specification of what to build
/speckit.plan — technical implementation design
/speckit.tasks — decomposition into verifiable units

Spec Kit works with 30+ coding agents including Codex CLI ⁸. Over 70 community extensions add compliance gates, architecture guards, and visibility hooks.

TDAD (Test-Driven AI Agent Definition)

The TDAD pipeline ⁹ treats agent prompts as compiled artefacts:

Engineers provide behavioural specifications
TestSmith converts specifications to executable tests
PromptSmith iteratively refines prompts until tests pass
MutationSmith generates faulty prompt variants post-compilation to validate test suite quality

Across 24 independent trials, TDAD achieves ⁹:

92% v1 compilation success
97% mean hidden-test pass rate
86–100% mutation scores
97% regression safety under specification evolution

The Pilot Evidence

Farrag ran a four-month within-subject pilot with 14 mid-to-senior engineers across three full-stack projects (React/Node, Vue/NestJS, React/Go microservices) ¹:

Metric	Baseline (2 months)	Spec Kit Phase (2 months)
Median lead time	8–12 days	6–9 days
Late-stage hotfixes per sprint	3–5	1–2
Rollbacks per month	2–4	0–1
Code churn (reverted <2 weeks)	12–18%	6–10%
Developer confidence (Likert 1–5)	3.1	3.9
Spec authoring overhead	—	45–90 min per medium feature

The critical finding: specification-driven workflows shift verification effort earlier (spec authorship costs 45–90 minutes) rather than eliminating it. The payoff comes in reduced late-stage regressions and rollbacks ¹.

Mapping the SGM to Codex CLI

Codex CLI’s configuration architecture maps cleanly onto all four SGM governance levels.

Level 1: Post-hoc Review (Low-Risk Tasks)

Default Codex CLI behaviour — the agent generates, the developer reviews. Suitable for isolated utility functions in greenfield code.

Level 2: Natural-Language Specification via AGENTS.md

AGENTS.md files provide hierarchical, repository-scoped natural-language constraints that Codex CLI reads automatically ¹⁰. Research across 138 repositories found that developer-written AGENTS.md files reduce agent-generated bugs by 35–55% ¹¹.

<!-- AGENTS.md -->
# Specification Constraints

## Architecture
- All new endpoints MUST have an OpenAPI schema before implementation
- Database migrations require a rollback script in the same PR

## Testing
- Minimum 80% branch coverage for new modules
- Integration tests required for any cross-service communication

Level 3: Executable Contracts via PostToolUse Hooks

PostToolUse hooks execute after every tool call, enforcing executable contracts in real time ¹²:

# config.toml — executable specification enforcement
[hooks.post_tool_use]
command = "bash -c 'npm run lint && npm run test:affected -- --bail'"
timeout_ms = 60000
on_failure = "block"

Combined with Spec Kit’s /speckit.tasks output as a structured specification, this creates a closed loop: the specification defines what should be true, and the hook verifies it after every agent action.

Level 4: Constitutional Governance via requirements.toml

For enterprise environments, requirements.toml provides admin-enforced, non-overridable constraints ¹³:

# requirements.toml — constitutional constraints
[approval_policy]
default = "unless-allow-listed"

[sandbox]
mode = "full"

[hooks.post_tool_use]
command = "bash -c './scripts/security-scan.sh && ./scripts/spec-compliance.sh'"
on_failure = "block"

This maps directly to Spec Kit’s /speckit.constitution — principles that no individual developer or agent can override.

graph TD
    subgraph "Codex CLI Specification Governance Stack"
        A[requirements.toml<br/>Constitutional Governance] --> B[AGENTS.md<br/>Natural-Language Specification]
        B --> C[PostToolUse Hooks<br/>Executable Contracts]
        C --> D[Approval Policy<br/>Post-hoc Review]
    end

    subgraph "SGM Mapping"
        E[Level 4: Constitutional] -.-> A
        F[Level 2: NL Spec] -.-> B
        G[Level 3: Executable] -.-> C
        H[Level 1: Review] -.-> D
    end

Breaking the Review Bottleneck

The review bottleneck amplifies the paradox because human review capacity is fixed whilst AI generation capacity is unbounded. Codex CLI offers three mechanisms to compress the review burden without abandoning it:

1. Hook-Based Pre-Review Filtering

PostToolUse hooks catch specification violations before a PR reaches a human reviewer. If the hook blocks, the agent self-corrects. The reviewer sees only compliant diffs.

2. Subagent Delegation with Scoped Specifications

Codex CLI’s subagent model supports per-directory AGENTS.md files ¹⁰. Each subagent inherits a narrower specification scope, producing smaller, more reviewable diffs:

<!-- services/payments/AGENTS.md -->
# Payment Service Specification
- All monetary values use decimal types, never floating-point
- PCI DSS: never log card numbers, even in debug mode
- Every state transition requires an idempotent retry test

3. Structured Diff Review via Stop Hooks

A Stop hook can generate a structured summary of all changes made during a session, formatted for rapid reviewer consumption:

[hooks.stop]
command = "bash -c './scripts/generate-review-brief.sh'"

The Practitioner Decision Framework

Adapted from Farrag’s governance decision guide and mapped to Codex CLI configuration ¹:

Situation	Governance Level	Codex CLI Configuration
Isolated function, greenfield	Post-hoc review	Default behaviour
Module change, mature codebase	NL specification	AGENTS.md per-directory
Cross-module refactor	Executable contract	PostToolUse hooks + Spec Kit tasks
Security-critical, regulated	Constitutional	requirements.toml + approval_policy

One caveat the paper stresses: specification authorship cannot replace code literacy. Junior developers must develop dual competency in both implementation and specification authorship ¹. Using Codex CLI’s suggest mode — where the agent proposes but the developer decides — provides deliberate practice that full-auto mode does not.

Conclusion

The Productivity-Reliability Paradox is not a bug in AI coding tools. It is an emergent property of accelerating generation without proportionally governing the output. The fix is not better models — it is better specifications.

Codex CLI already ships the machinery to implement specification governance at every level: AGENTS.md for natural-language constraints, PostToolUse hooks for executable contracts, requirements.toml for constitutional governance, and subagent delegation for scope isolation. The missing piece, for most teams, is the discipline to write the specification before prompting the agent.

Forty-five minutes of specification authorship per feature is the tax. Halving your rollbacks and late-stage hotfixes is the return.

Citations

Farrag, S.E. (2026). “The Productivity-Reliability Paradox: Specification-Driven Governance for AI-Augmented Software Development.” arXiv:2605.01160. https://arxiv.org/abs/2605.01160 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹
Faros AI (2025). “AI Engineering Impact Report.” Based on telemetry from 10,000+ developers across 1,255 teams. https://www.faros.ai/research ↩ ↩² ↩³ ↩⁴
DORA (2024). “Accelerate State of DevOps Report.” Survey of 3,000 organisations showing 7.2% delivery stability decrease with high AI adoption. https://dora.dev/research/ ↩
Becker, J. et al. (2025). “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity.” METR RCT, 16 developers, 246 tasks. https://arxiv.org/abs/2507.09089 ↩
Peng, S. et al. (2023). “The Impact of AI on Developer Productivity: Evidence from GitHub Copilot.” RCT with 95 developers showing 55–56% faster task completion. https://arxiv.org/abs/2302.06590 ↩
Faros AI (2026). “Acceleration Whiplash: AI Coding Throughput vs. Quality.” 22,000 developers, 4,000 teams. Bugs per developer up 54%, production incidents per PR tripled. https://www.faros.ai/research ↩
Williamson, O.E. (1985). The Economic Institutions of Capitalism. Free Press. ↩
GitHub (2026). “Spec Kit: Toolkit for Spec-Driven Development.” v0.11.0, MIT licence, 93,000+ stars. https://github.com/github/spec-kit ↩ ↩²
TDAD (2026). “Test-Driven AI Agent Definition: Compiling Tool-Using Agents from Behavioral Specifications.” arXiv:2603.08806. https://arxiv.org/abs/2603.08806 ↩ ↩²
OpenAI (2026). “Custom instructions with AGENTS.md.” Codex CLI documentation. https://developers.openai.com/codex/guides/agents-md ↩ ↩²
On the Impact of AGENTS.md Files on the Efficiency of AI Coding Agents. arXiv:2601.20404. https://arxiv.org/abs/2601.20404 ↩
OpenAI (2026). “Codex CLI Features — Hooks.” https://developers.openai.com/codex/cli/features ↩
OpenAI (2026). “Codex CLI Configuration Reference.” https://developers.openai.com/codex/config-reference ↩