Sketchnote diagram for: What 33,000 Agentic Pull Requests Reveal: Empirical Lessons for Codex CLI Practitioners

What 33,000 Agentic Pull Requests Reveal: Empirical Lessons for Codex CLI Practitioners

AI coding agents are no longer experimental curiosities — they now submit hundreds of thousands of pull requests to real repositories every month. But how do those contributions actually compare to human ones? And more importantly, what should Codex CLI users do differently based on the evidence?

A cluster of five papers accepted at MSR 2026 (the IEEE/ACM International Conference on Mining Software Repositories) analysed the AIDev dataset — over 933,000 agentic PRs across 61,000 repositories¹ — and the findings challenge several assumptions about agentic workflows. This article synthesises the key results and maps them to actionable Codex CLI configuration and workflow patterns.

The Data: Five Studies, One Dataset

The MSR 2026 Mining Challenge produced a rare convergence: multiple independent research teams analysing the same corpus of agent-generated pull requests. The studies examined different dimensions of the same phenomenon:

Study	Focus	Sample	Key Metric
Ogenrwot & Businge²	Code modification patterns	24,014 agentic vs 5,081 human PRs	Commit structure, files touched
Ehsani et al.³	Failure taxonomy	33,596 agentic PRs (9,582 failed)	28.5% failure rate
Pinna et al.⁴	Task-stratified acceptance	7,156 PRs across 5 agents	82.1% docs vs 66.1% features
Pham & Ghaleb⁵	Description quality & code churn	33,596 agentic vs 6,618 human PRs	Symbol survival: 3 vs 34 days
Watanabe et al.⁶	Reviewer interaction dynamics	AIDev dataset, 5 agents	PR style ↔ review engagement

graph TD
    A[AIDev Dataset<br/>933K+ agentic PRs] --> B[Ogenrwot & Businge<br/>Code Modification Patterns]
    A --> C[Ehsani et al.<br/>Failure Taxonomy]
    A --> D[Pinna et al.<br/>Task-Stratified Acceptance]
    A --> E[Pham & Ghaleb<br/>Description Quality]
    A --> F[Watanabe et al.<br/>Reviewer Dynamics]
    B --> G[Practical Implications<br/>for Codex CLI Users]
    C --> G
    D --> G
    E --> G
    F --> G

Finding 1: Nearly One in Three Agentic PRs Fails

Ehsani et al. found that 28.52% of agentic PRs fail to merge³. But the reasons are not what most developers expect. The failure taxonomy breaks down as follows:

Category	Share	Primary Cause
Reviewer abandonment	38%	No meaningful human interaction
PR-level issues	31%	Duplicates (23%), unwanted features (4%)
Code-level failures	22%	CI/test failures (17%), incorrect implementations (3%)
Agentic misalignment	2%	Agent ignores reviewer feedback
Licensing issues	<1%	CLA violations

The dominant failure mode is not bad code — it is social and workflow misalignment. Agents submit PRs that nobody asked for, duplicate existing work, or receive no reviewer attention at all³.

What This Means for Codex CLI

The implication is clear: the agent’s code quality matters less than its integration into your team’s workflow. Configuration that constrains scope and ensures human oversight outperforms configuration that maximises autonomy.

# config.toml — constrain scope to reduce abandonment and duplication
[policy]
approval = "on-request"        # require approval before tool execution
sandbox = "workspace-write"    # limit filesystem access to project directory

[agents]
max_subagents = 2              # prevent unbounded parallel work

Use plan mode (codex --approval-mode suggest) for exploratory tasks where the agent might otherwise produce unwanted features. Reserve full-auto for well-scoped, repeatable work like documentation updates and CI fixes.

Finding 2: Task Type Predicts Success Better Than Agent Choice

Pinna et al. found that task categorisation is the dominant factor in PR acceptance, with a 16-percentage-point gap between the best and worst task types — larger than the typical inter-agent variance⁴.

Task Type	Acceptance Rate
Documentation	82.1%
CI/Build	~78%
Testing	~75%
Refactoring	~72%
Bug fixes	~68%
New features	66.1%

Agent-by-agent merge rates from Ehsani et al. reinforce this: OpenAI Codex achieved 82.59% overall, Claude Code 59.04%, and Copilot 43.04%³. But Pinna et al. showed that when controlled for task type, the differences narrow substantially⁴. Codex’s advantage partly reflects the types of tasks its users assign to it.

What This Means for Codex CLI

Match the agent to the task, not the other way round. Use Codex CLI’s profiles to create task-specific configurations:

# Profile for documentation tasks — high confidence, relaxed approval
[profiles.docs]
model = "gpt-5.3-codex"       # smaller model suffices for docs
approval = "auto-edit"

# Profile for bug fixes — tighter constraints, plan-first
[profiles.bugfix]
model = "gpt-5.4"
approval = "on-request"

# Launch with task-appropriate profile
codex --profile docs "Update the API reference for the new /users endpoint"
codex --profile bugfix "Fix the race condition in OrderProcessor.submit()"

For documentation-class tasks — where agents excel — consider thread automations to run them on a schedule:

# Weekly documentation freshness check
codex exec --profile docs \
  "Review all public API endpoints and update any stale examples in docs/"

Finding 3: Smaller PRs Win, and Each CI Failure Costs 15%

Two findings from the failure analysis have direct operational implications:

Rejected PRs involve 17% more lines of code and touch 10% more files than successful ones³
Each failed CI check reduces merge odds by approximately 15%, with a Cliff’s delta effect size of 0.24³

The message is unambiguous: agents should produce smaller, focused changesets and validate against CI before submission.

What This Means for Codex CLI

Structure your AGENTS.md to enforce small, atomic changes:

<!-- AGENTS.md -->
## Pull Request Guidelines

- Each PR should address exactly ONE concern (single bug fix, single feature, single refactor)
- Maximum 300 lines changed per PR — if a task requires more, break it into sequential PRs
- Always run the full test suite before committing: `npm test`
- Always run the linter before committing: `npm run lint`
- If CI fails, fix the failure before opening the PR — never submit with known failures

For automated workflows via codex exec or codex-action, add explicit CI validation:

# Run tests as part of the agent workflow, not after
codex exec "Fix the failing test in auth.spec.ts. \
  After making changes, run 'npm test' to verify all tests pass. \
  Only commit if tests succeed."

Finding 4: Agent Code Is Removed Faster Than Human Code

Pham & Ghaleb found that symbols introduced by agents have a median survival time of 3 days, compared to 34 days for human-introduced symbols⁵. The churn rate for agent code is 7.33% versus 4.10% for human code⁵.

This does not necessarily mean agent code is worse — it may reflect that agents are assigned more experimental or iterative tasks. But it does mean that agent-generated code receives less durable trust from maintainers.

What This Means for Codex CLI

If agent code is revised quickly, the cost of that revision should be minimised. Two strategies help:

1. Cross-model review before merge. Use the /review command or a cross-provider review loop to catch issues before they reach human reviewers:

# Local review before pushing
codex /review --diff HEAD~1

2. Treat agent commits as drafts. Use draft PRs for agent-generated work and promote to ready-for-review only after local validation:

# Create as draft PR via GitHub CLI
gh pr create --draft --title "docs: update API reference" \
  --body "Agent-generated, pending human review"

Finding 5: Agents Write Better Commits but Worse PR Summaries

Pham & Ghaleb found a curious asymmetry: agents produce commit messages with higher semantic similarity to their diffs (0.72 vs 0.68 for humans) but lower-quality PR-level summaries (0.86 vs 0.88)⁵. Agents are precise at the micro level but struggle to synthesise across multiple changes into a coherent narrative.

Watanabe et al. confirmed that PR description style is associated with measurable differences in reviewer engagement, response time, and merge outcomes⁶. The way an agent presents its work matters as much as the work itself.

What This Means for Codex CLI

Compensate for the PR-summary weakness by encoding your team’s PR template in AGENTS.md:

<!-- AGENTS.md -->
## PR Description Format

When creating a PR description, always include:
1. **Summary**: One paragraph explaining WHY the change was made (not what changed)
2. **Changes**: Bullet list of what was modified and why each change matters
3. **Testing**: How the changes were verified
4. **Risk Assessment**: What could go wrong and how to roll back

Do NOT simply list the files changed or repeat commit messages.
Focus on the reviewer's perspective: what context do they need to approve this?

Finding 6: Code Smells Dominate Post-Merge Quality Issues

A complementary study by Jain et al. analysed 1,210 merged agent-generated bug-fix PRs using SonarQube static analysis⁷. The most common post-merge quality issues were:

Issue Type	Most Common Violation	Count
Code smells	Duplicated string literals	212
Code smells	Excessive cognitive complexity	157
Code smells	Unused function parameters	114
Bugs	Incorrect number of function arguments	23
Security	Weak encryption operations	33

Critically, after normalising for PR size (lines of code changed), the differences between agents largely disappeared⁷. Larger PRs produce more issues, regardless of which agent wrote them — further supporting the “smaller is better” principle.

What This Means for Codex CLI

Add static analysis to your agent workflow. Codex can run linters as part of its execution loop:

<!-- AGENTS.md -->
## Code Quality Gates

Before committing any changes:
1. Run `npx eslint --max-warnings 0 .` (or your project's linter)
2. Run `npx tsc --noEmit` for TypeScript projects
3. Fix any issues found — do not commit with warnings
4. For Python projects: `ruff check . && mypy .`

For enterprise teams, combine with the Guardian auto-review model (codex-auto-review) to catch quality issues at the model level before they reach static analysis⁸.

The Synthesis: A Research-Backed Codex CLI Workflow

flowchart TD
    A[Identify Task] --> B{Task Type?}
    B -->|Docs / CI / Build| C[Use lightweight profile<br/>gpt-5.3-codex, auto-edit]
    B -->|Bug fix / Feature| D[Use plan mode<br/>gpt-5.4, on-request]
    C --> E[Keep PR small<br/>≤300 lines, single concern]
    D --> E
    E --> F[Run tests & linter<br/>inside agent session]
    F --> G{CI passes?}
    G -->|No| H[Fix in same session<br/>before committing]
    G -->|Yes| I[Run /review locally]
    H --> F
    I --> J[Write PR summary<br/>using template from AGENTS.md]
    J --> K[Open as draft PR]
    K --> L[Human review & promote]

The research converges on six principles:

Constrain scope — smaller, focused PRs merge at significantly higher rates
Match task to profile — documentation and CI tasks need less oversight than features and bug fixes
Validate before submission — each CI failure costs ~15% merge probability
Invest in PR presentation — description style directly affects reviewer engagement
Treat agent output as draft — the 3-day median symbol survival suggests maintainers will revise anyway
Run static analysis early — quality issues scale with PR size, not with agent choice

Limitations and Caveats

These studies carry important caveats. The AIDev dataset captures open-source GitHub contributions, which may not reflect private enterprise repositories where contribution norms differ. The agent identification methodology relies on bot signatures and may misattribute some PRs³. The studies also predate the April 2026 Codex update that introduced built-in memory, thread automations, and the codex-auto-review model⁸ — features that may alter the dynamics observed.

⚠️ The merge rates reported (e.g., Codex at 82.59%) reflect the AIDev dataset’s agent detection heuristics and should not be interpreted as absolute quality rankings. Different detection methods, repository populations, and time windows could produce different figures.

Citations

AIDev Dataset, MSR 2026 Mining Challenge. Available at https://www.emergentmind.com/topics/agentic-pull-requests-prs ↩
Ogenrwot, D. & Businge, J. (2026). “How AI Coding Agents Modify Code: A Large-Scale Study of GitHub Pull Requests.” MSR 2026 Mining Challenge Track. arXiv:2601.17581 ↩
Ehsani, R., Pathak, S., Rawal, S., Al Mujahid, A., Imran, M. M. & Chatterjee, P. (2026). “Where Do AI Coding Agents Fail? An Empirical Study of Failed Agentic Pull Requests in GitHub.” MSR 2026. arXiv:2601.15195 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷
Pinna, G., Gong, J., Williams, D. & Sarro, F. (2026). “Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance.” MSR 2026 Mining Challenge Track. arXiv:2602.08915 ↩ ↩² ↩³
Pham, D. & Ghaleb, T. A. (2026). “Code Change Characteristics and Description Alignment: A Comparative Study of Agentic versus Human Pull Requests.” MSR 2026 Mining Challenge Track. arXiv:2601.17627 ↩ ↩² ↩³ ↩⁴
Watanabe, K., Tsuchida, R., Monno, T., Huang, B., Yamasaki, K., Fan, Y., Shimari, K. & Matsumoto, K. (2026). “How AI Coding Agents Communicate: A Study of Pull Request Description Characteristics and Human Review Responses.” arXiv:2602.17084 ↩ ↩²
Jain, S. et al. (2026). “Beyond Bug Fixes: An Empirical Investigation of Post-Merge Code Quality Issues in Agent-Generated Pull Requests.” MSR 2026. arXiv:2601.20109 ↩ ↩²
OpenAI (2026). Codex CLI v0.122.0-alpha.5: codex-auto-review model for Guardian. PR #18169, github.com/openai/codex ↩ ↩²