ABTest and Behaviour-Driven Fuzzing: What 647 Fuzzing Cases Reveal About Coding Agent Robustness — and How to Defend Your Codex CLI Workflows
ABTest and Behaviour-Driven Fuzzing: What 647 Fuzzing Cases Reveal About Coding Agent Robustness — and How to Defend Your Codex CLI Workflows
The Problem: Benchmarks Measure Correctness, Not Robustness
Most coding agent evaluations — SWE-bench Verified, Terminal-Bench, PolyBench — ask a single question: did the agent produce the right output? They say nothing about what happens when workflows hit edge cases, conflicting instructions, interrupted operations, or boundary conditions. Dai et al.’s ABTest framework (arXiv:2604.03362, April 2026) addresses this gap directly by fuzzing three production coding agents — including Codex CLI — with systematically generated behavioural tests derived from real user-reported failures 1.
The results are sobering. Across 647 repository-grounded fuzzing cases, ABTest flagged 1,573 behavioural anomalies, of which 642 were manually confirmed as genuine — a 40.8% detection precision that exposes failure modes invisible to conventional benchmarks 1.
How ABTest Works: A Five-Stage Pipeline
ABTest’s architecture converts historical bug reports into executable behavioural tests through a structured pipeline:
flowchart LR
A["Mine\n400 confirmed\nfailure reports"] --> B["Compose\n47 Interaction Patterns\n× 128 Action Types"]
B --> C["Instantiate\n647 repo-grounded\nfuzzing cases"]
C --> D["Execute\nRun on agents\nrecord traces"]
D --> E["Detect\nAutomated + manual\nanomaly validation"]
Stage 1 — Mine. The researchers collected 400 developer-confirmed bug reports from the GitHub issue trackers of Claude Code, Codex CLI, and Gemini CLI between July 2025 and January 2026. Each report was abstracted into reusable patterns 1.
Stage 2 — Compose. Two taxonomies emerged: 47 Interaction Patterns (workflow skeletons describing multi-step agent behaviours) and 128 Action Types (specific tool-level operations that stress failure boundaries). Templates pair compatible patterns with actions 1.
Stage 3 — Instantiate. An LLM-based task generator binds each template to a real repository, filling in concrete file paths, command targets, and validation criteria 1.
Stage 4 — Execute. Each test case runs against the target agent whilst recording prompts, step traces, file changes, and final artifacts 1.
Stage 5 — Detect. Automated checks flag suspicious behaviours, followed by manual verification against two criteria: instruction-following consistency and action consistency between expected and observed repository effects 1.
The Interaction Pattern Taxonomy
The 47 Interaction Patterns encode recurring workflow structures that historically trigger agent failures. Several are directly relevant to Codex CLI users:
| Pattern | Description | Anomalies Found |
|---|---|---|
| IP-28 | Validate outcome against constraints, then emit result | 23 |
| IP-39 | Run operation → derive structured result → validate → persist | 19 |
| IP-26 | Generate output → persist narrative → persist structured → validate both | 18 |
| IP-44 | Attempt restricted-path transformation → validate permission handling | 17 |
| IP-41 | Start operation → interrupt mid-run → verify partial artifact → re-run | — |
| IP-47 | Set usage cap → run → verify bounded stop → capture evidence | — |
The pattern that should concern Codex CLI users most is IP-44: when an agent encounters a restricted path (outside writable_roots, for instance), does it fail gracefully or escalate with misleading claims? 1
The 128 Action Types: Where Agents Actually Break
Action Types capture the specific operations that expose failure modes:
- “Satisfy conflicting output instructions” — 19 anomalies. When AGENTS.md says one thing and the user prompt says another, agents often silently pick one without acknowledging the conflict 1.
- “Run with verbose logging while protecting sensitive values” — 17 anomalies. Agents leak secrets into trace output or suppress logging entirely 1.
- “Write result without overwriting existing artifact” — 14 anomalies. File-write operations that should append or create new files instead clobber existing work 1.
- “Run with configuration plus conflicting environment overrides” — 13 anomalies. Config.toml settings vs. environment variables produce unpredictable precedence 1.
Per-Agent Results: Model Choice Dramatically Affects Robustness
The per-agent breakdown reveals that robustness varies wildly not just between agents but between model configurations within the same agent:
| Agent | Model | Anomalies Flagged | Verified | Precision |
|---|---|---|---|---|
| Codex CLI | GPT-5.1-Codex-Mini | 277 | 166 | 59.9% |
| Codex CLI | GPT-4o-mini | 334 | 95 | 28.4% |
| Claude Code | Claude 4.5 Haiku | 259 | 119 | 45.9% |
| Claude Code | Claude 3.5 Haiku | 376 | 87 | 23.1% |
| Gemini CLI | Gemini 2.5 Flash-Lite | 327 | 175 | 53.5% |
Two findings stand out for Codex CLI users:
-
GPT-5.1-Codex-Mini produced 74% more verified anomalies than GPT-4o-mini (166 vs 95), with nearly double the precision (59.9% vs 28.4%). The larger model “keeps pushing to complete under pressure and over-acts,” whilst the smaller model “stops short of required effects and substitutes helper workflows” 1.
-
The two Codex CLI configurations share only one critical anomaly, suggesting that model capability doesn’t uniformly improve robustness — it shifts failure modes 1.
Anomaly Severity: Three Tiers
ABTest classifies confirmed anomalies into three severity tiers 1:
pie title Anomaly Distribution (642 confirmed)
"Critical — boundary overreach" : 134
"Expected Outcome — missing artifacts" : 140
"Minor — low-severity deviations" : 368
- Critical (134): Boundary overreach, permission violations, or misleading escalation claims (e.g., reporting “files are lost” rather than acknowledging uncertainty) 1.
- Expected Outcome (140): The agent fails to materialise required artifacts or reach the expected completion state 1.
- Minor (368): Low-severity deviations — formatting inconsistencies, unnecessary verbose output, or suboptimal but functional solutions 1.
Gemini CLI showed the highest critical anomaly concentration at 36.0% of its verified set, whilst Codex CLI’s critical anomalies concentrated in the GPT-5.1-Codex-Mini configuration 1.
Mapping ABTest Findings to Codex CLI Defence Patterns
ABTest’s findings map directly to Codex CLI’s configuration and hook system. Here’s a defence-in-depth configuration that addresses the most common anomaly categories:
1. Conflict Detection via AGENTS.md
The “conflicting output instructions” action type (19 anomalies) points to a need for explicit precedence rules in your AGENTS.md:
# AGENTS.md — Conflict Resolution Rules
## Instruction Precedence
1. AGENTS.md constraints are non-negotiable
2. If user prompt conflicts with AGENTS.md, flag the conflict explicitly
3. Never silently resolve ambiguity — ask for clarification
## File Safety
- Never overwrite existing files without explicit confirmation
- Use timestamped suffixes for new artifacts: `result-YYYYMMDD-HHMMSS.ext`
2. PostToolUse Hooks for Artifact Validation
The “expected outcome” anomaly tier (140 cases where agents failed to produce required artifacts) can be mitigated with PostToolUse validation hooks 2:
# config.toml — PostToolUse validation hook
[hooks.post_tool_use]
command = "bash /path/to/validate-artifacts.sh"
#!/bin/bash
# validate-artifacts.sh — Check that expected outputs exist
# Runs after every tool invocation
LAST_CMD="$CODEX_LAST_COMMAND"
WORKDIR="$CODEX_WORKING_DIR"
# If the last command was supposed to create files, verify they exist
if echo "$LAST_CMD" | grep -qE "(write|create|generate|save)"; then
MODIFIED=$(git diff --name-only HEAD 2>/dev/null)
if [ -z "$MODIFIED" ]; then
echo "WARNING: Command claimed to create files but no changes detected"
exit 1
fi
fi
3. Sensitive Value Protection
The “verbose logging while protecting sensitive values” action type (17 anomalies) maps to Codex CLI’s trace log controls. The v0.142.5 release (1 July 2026) specifically addressed this by preventing full Responses WebSocket request payloads from being written to trace logs 3:
# config.toml — Restrict trace verbosity
[logging]
trace_level = "summary" # Avoid full payload logging
4. Profile-Based Model Routing for Robustness
Given ABTest’s finding that model choice shifts failure modes rather than eliminating them, consider routing different workflow types to different models via named profiles 4:
# config.toml — Route by workflow risk level
[profile.exploration]
model = "gpt-5.4-mini"
# Lower-capability model for read-only exploration — less overreach
[profile.implementation]
model = "o4-mini"
# Reasoning model for complex implementation — better constraint adherence
[profile.validation]
model = "gpt-5.4-mini"
approval_policy = "suggest"
# Conservative model + strict approval for validation workflows
Repeatability and Production Implications
ABTest achieved over 90% repeatability across reruns, confirming that the anomalies represent genuine behavioural patterns rather than transient model stochasticity 1. This has a practical implication: if your Codex CLI workflow hits one of these failure modes, it will likely hit it again under similar conditions.
The framework also complements recent work on property-based testing for coding agents. PBT-Bench (arXiv:2605.15229, May 2026) demonstrates that agents struggle with semantic invariant derivation — the kind of deep specification reasoning that ABTest’s IP-28 pattern (“validate outcome against constraints”) also stresses 5. Together, these benchmarks suggest that the robustness gap in coding agents is structural, not merely a matter of scaling model capability.
What This Means for Your Codex CLI Practice
Three actionable takeaways:
-
Don’t assume model upgrades improve robustness. ABTest shows that GPT-5.1-Codex-Mini produces more verified anomalies than GPT-4o-mini, not fewer. Test your specific workflows when switching models 1.
-
Validation patterns trigger the most failures. Workflows combining generation → persistence → validation → re-validation (IP-39, IP-26, IP-28) are where agents most commonly break. Add explicit PostToolUse hooks at each validation checkpoint 2.
-
Conflicting instructions are your biggest risk. When AGENTS.md rules, user prompts, and environment configuration disagree, agents silently resolve ambiguity. Make your instruction hierarchy explicit and add PreToolUse hooks that flag conflicts before execution 2.
The broader lesson from ABTest is that end-result correctness — the metric our benchmarks optimise for — masks a substantial process-level fragility. Codex CLI’s hook system gives you the tools to catch these failures at runtime, but only if you know where to look.
Citations
-
Dai, W., Openja, M., Pham, H.V., Uddin, G., Yang, J., & Wang, S. (2026). “ABTest: Behavior-Driven Testing for AI Coding Agents.” arXiv:2604.03362v2. https://arxiv.org/abs/2604.03362 ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11 ↩12 ↩13 ↩14 ↩15 ↩16 ↩17 ↩18 ↩19 ↩20 ↩21
-
OpenAI. (2026). “Codex CLI Hooks — Events, Policy, and Patterns.” OpenAI Developers Documentation. https://developers.openai.com/codex/cli/hooks ↩ ↩2 ↩3
-
OpenAI. (2026). “Codex CLI v0.142.5 Release Notes.” GitHub Releases. https://github.com/openai/codex/releases/tag/v0.142.5 ↩
-
OpenAI. (2026). “Codex CLI Configuration Reference — Named Profiles.” OpenAI Developers Documentation. https://developers.openai.com/codex/config-reference ↩
-
Jing, L., Wang, X., Zhang, L., & Du, S.S. (2026). “PBT-Bench: Benchmarking AI Agents on Property-Based Testing.” arXiv:2605.15229v3. https://arxiv.org/abs/2605.15229 ↩