AgentAssay and the Regression Testing Gap: Statistical Verification for Non-Deterministic Codex CLI Agent Workflows
AgentAssay and the Regression Testing Gap: Statistical Verification for Non-Deterministic Codex CLI Agent Workflows
You changed a single line in your AGENTS.md. Your codex exec CI pipeline still passes. But your agent now takes 40% more steps, calls tools in the wrong order, and occasionally hallucinates a dependency. Binary pass/fail testing told you nothing was wrong.
This is the regression testing gap for non-deterministic agent workflows — and AgentAssay, a framework backed by peer-reviewed research (arXiv:2603.02601), offers the first principled methodology for closing it 1. This article examines how its statistical testing primitives map onto Codex CLI’s hook pipeline, codex exec automation, and multi-model deployment patterns.
The Problem: Why Binary Verdicts Fail for Agent Workflows
Traditional software testing relies on determinism: the same input produces the same output. Coding agents violate this assumption fundamentally. The same prompt, tools, and model can produce divergent tool-call sequences, different code outputs, and variable step counts across runs 2. A Codex CLI session that calls grep, sed, then git commit on Monday might call rg, awk, then git add && git commit on Tuesday — both correct, both different.
This non-determinism means that:
- Single-run testing is statistically meaningless. One passing run proves nothing about the next.
- Binary verdicts mask behavioural drift. An agent can “pass” whilst degrading across every other quality dimension.
- Prompt and config changes propagate unpredictably. Editing AGENTS.md, switching models, or updating an MCP server can shift agent behaviour in ways that functional tests never surface.
No principled methodology existed for verifying that an agent has not regressed after changes to its prompts, tools, models, or orchestration logic — until AgentAssay introduced stochastic three-valued verdicts and behavioural fingerprinting 1.
AgentAssay’s Core Architecture
AgentAssay replaces binary pass/fail with a statistical testing pipeline designed for non-deterministic workflows. The framework comprises four layers:
flowchart TD
A[Agent Under Test] --> B[Trace Collection]
B --> C[Behavioural Fingerprinting]
B --> D[Five-Dimensional Coverage]
B --> E[Metamorphic Relations]
C --> F[Hotelling T² Regression Detection]
D --> G[Geometric Mean Coverage Score]
E --> H[Invariant Verification]
F --> I{SPRT Verdict Engine}
G --> I
H --> I
I --> J[PASS / FAIL / INCONCLUSIVE]
J --> K[CI/CD Deployment Gate]
Three-Valued Verdicts
Rather than forcing a binary decision, AgentAssay uses Wald’s Sequential Probability Ratio Test (SPRT) to return one of three verdicts: PASS, FAIL, or INCONCLUSIVE 1. The INCONCLUSIVE verdict is critical — it explicitly signals insufficient evidence rather than silently defaulting to pass. In Codex CLI CI pipelines, an INCONCLUSIVE result can trigger additional trial runs or flag the change for human review via a PostToolUse hook.
Behavioural Fingerprinting
The breakthrough contribution is behavioural fingerprinting: converting execution traces into compact vectors capturing tool invocation patterns, decision frequency distributions, output characteristics, and cost metrics 1. These vectors are compared via Hotelling’s T² test — a multivariate statistical test that detects distributional shifts across multiple dimensions simultaneously.
The paper reports 86% detection power where binary testing achieves 0% 1. That figure deserves emphasis: for certain classes of regression (behavioural drift without functional failure), binary testing is literally incapable of detection.
Five-Dimensional Coverage Metrics
AgentAssay defines coverage across five complementary dimensions 1:
| Dimension | What It Measures | Codex CLI Mapping |
|---|---|---|
| Tool Coverage | Fraction of available tools exercised | Shell, file, search, web tools used |
| Decision-Path Coverage | Distinct execution sequences observed | Different code-generation strategies |
| State-Space Coverage | Agent states visited (normalised for infinity) | Context window states, compaction points |
| Boundary Coverage | Tool parameter extremes tested | Max file sizes, edge-case inputs |
| Model Coverage | Cross-model compatibility verified | GPT-5.5 vs GPT-5.3-Codex-Spark |
Overall coverage uses the geometric mean across all five dimensions, which penalises imbalance — a score of 95% tool coverage but 10% boundary coverage produces a low composite, forcing teams to address gaps rather than cherry-pick metrics 1.
Mapping AgentAssay to Codex CLI Workflows
Trace Collection via codex exec
Codex CLI’s non-interactive mode (codex exec) already produces structured JSONL session transcripts stored in ~/.codex/sessions/ 3. These transcripts contain every tool call, model response, approval event, and timing metric — precisely the trace data AgentAssay requires.
The CustomAdapter pattern bridges the two:
from agentassay.integrations import CustomAdapter
import subprocess
import json
def codex_agent(input_data: dict) -> dict:
result = subprocess.run(
["codex", "exec", "--full-auto", "-q", input_data["prompt"]],
capture_output=True, text=True,
cwd=input_data.get("workspace", ".")
)
return {
"output": result.stdout,
"exit_code": result.returncode,
"stderr": result.stderr
}
adapter = CustomAdapter(callable_fn=codex_agent)
trace = adapter.run({"prompt": "Add input validation to api/handlers.go"})
For richer trace data, the MCPToolsAdapter can intercept MCP tool calls directly when Codex CLI acts as an MCP client 4.
Hook-Based Regression Gates
Codex CLI’s hook pipeline provides natural integration points for AgentAssay’s deployment gates. A PostSession hook (triggered after each codex exec run in CI) can collect the session transcript, append it to the trial set, and evaluate the statistical verdict:
#!/usr/bin/env bash
# .codex/hooks/post-session-regression.sh
SESSION_FILE=$(ls -t ~/.codex/sessions/$(date +%Y/%m/%d)/rollout-*.jsonl | head -1)
python3 -c "
from agentassay.efficiency import BehavioralFingerprint
from agentassay.verdicts import VerdictFunction
import json, sys
baseline = BehavioralFingerprint.load('baseline-fingerprint.json')
current_trace = json.load(open('$SESSION_FILE'))
current_fp = BehavioralFingerprint.from_traces([current_trace])
drift = baseline.distance(current_fp)
if drift > 0.15:
print(f'REGRESSION DETECTED: drift={drift:.3f}', file=sys.stderr)
sys.exit(1)
"
Multi-Model Regression Detection
Codex CLI v0.142 supports configurable model routing with profiles and per-turn model selection 5. When teams switch between GPT-5.5, GPT-5.3-Codex-Spark, and third-party models via provider configuration, AgentAssay’s Model Coverage dimension ensures that behavioural consistency is verified across all target models:
from agentassay.core.runner import TrialRunner
from agentassay.core.scenario import TestScenario
scenario = TestScenario(
scenario_id="refactor-handler",
name="Handler refactoring task",
input_data={"prompt": "Refactor the error handling in api/handlers.go"},
expected_properties={
"max_steps": 15,
"must_use_tools": ["shell", "file_write"],
"must_not_use_tools": ["web_search"]
}
)
# Test across all configured models
for model in ["gpt-5.5", "gpt-5.3-codex-spark", "o4-mini"]:
runner = TrialRunner(
agent_fn=lambda inp: codex_agent({**inp, "model": model}),
config={"model": model}
)
results = runner.run_trials(scenario, n=30)
print(f"{model}: {results.pass_rate:.1%} "
f"[{results.ci_lower:.1%}, {results.ci_upper:.1%}]")
Mutation Testing for AGENTS.md and Skills
AgentAssay’s four mutation operator classes map directly onto Codex CLI’s configuration surface 1:
| Mutation Class | Codex CLI Target | What It Tests |
|---|---|---|
| Prompt Mutations | AGENTS.md instructions | Robustness to instruction rewording |
| Tool Mutations | MCP server availability, tool ordering | Behaviour when tools are unavailable |
| Model Mutations | model in config.toml |
Cross-model consistency |
| Context Mutations | Session history, compaction | Behaviour under context pressure |
Prompt mutations are particularly valuable for AGENTS.md governance. A team can verify that their coding standards survive synonym substitution, instruction reordering, and noise injection — ensuring the agent follows the intent of the instructions rather than depending on specific phrasing:
agentassay mutate \
--scenario refactor-scenario.yaml \
--operators prompt \
--target-file .codex/AGENTS.md \
--kill-threshold 0.8
A kill rate below the threshold indicates that the test suite fails to detect when AGENTS.md instructions are degraded — a signal that the tests themselves need strengthening.
Adaptive Budget Optimisation
Running 100 trials per scenario at API rates is expensive. AgentAssay’s adaptive budget optimiser calibrates the minimum number of trials needed for a given confidence level based on observed behavioural variance 1:
from agentassay.efficiency import AdaptiveBudgetOptimizer
optimizer = AdaptiveBudgetOptimizer(alpha=0.05, beta=0.10)
estimate = optimizer.calibrate(calibration_traces)
print(f"Recommended trials: {estimate.recommended_n}")
print(f"Estimated cost: ${estimate.estimated_cost_usd:.2f}")
The paper reports 4–7× trial reduction for stable agents 1. For a Codex CLI workflow that consistently produces similar tool-call patterns (low behavioural variance), the optimiser might recommend 8 trials instead of 50 — reducing CI costs from tens of dollars to under a dollar per pipeline run.
Combined with Codex CLI’s rollout token budgets (added in v0.142) 5, teams can set hard cost ceilings on both individual agent runs and the regression testing harness around them.
CI/CD Integration Pattern
The complete integration chains AgentAssay’s statistical gates into a GitHub Actions workflow alongside codex exec:
flowchart LR
A[PR Opened] --> B[codex exec: Run Agent Task]
B --> C[Collect Session Traces]
C --> D[AgentAssay: Fingerprint Comparison]
D --> E{Verdict}
E -->|PASS| F[Merge Allowed]
E -->|FAIL| G[Block + Report Regression]
E -->|INCONCLUSIVE| H[Run Additional Trials]
H --> D
# .github/workflows/agent-regression.yml
name: Agent Regression Gate
on: pull_request
jobs:
regression-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: openai/codex-action@v1
with:
codex-api-key: ${{ secrets.CODEX_API_KEY }}
- name: Run agent trials
run: |
for i in $(seq 1 $TRIAL_COUNT); do
codex exec --full-auto -q "$(cat test-prompt.md)" \
2>/dev/null >> traces.jsonl
done
- name: Evaluate regression
run: |
agentassay compare \
--baseline baseline-fingerprint.json \
--current traces.jsonl \
--alpha 0.05 --beta 0.10 \
--format github-actions
Limitations and Open Questions
AgentAssay addresses a genuine gap, but several limitations bear noting:
- Single-author paper. The framework has a GitHub repository with 10 adapters 4, but wider academic validation is still emerging. ⚠️
- Codex CLI adapter not yet native. The
CustomAdapterandMCPToolsAdapterwork, but a dedicated Codex CLI adapter that natively parses JSONL session transcripts would reduce integration friction. - Trace-first analysis assumes trace availability. Codex CLI stores session transcripts by default, but teams using
history.persistence = falselose this data. - Cost-efficiency claims depend on scenario. The 78–100% cost reduction figures come from three specific scenarios (e-commerce, customer support, code generation) 1; mileage may vary for other workflow types.
Practical Recommendations
For teams running Codex CLI in production CI pipelines:
- Start with behavioural fingerprinting. Even without the full framework, comparing tool-call distributions across runs catches drift that binary tests miss.
- Use three-valued verdicts in deployment gates. Never default INCONCLUSIVE to PASS — treat it as a signal to gather more evidence.
- Mutation-test your AGENTS.md. If synonym substitution in your instructions changes agent behaviour, your instructions are fragile.
- Set adaptive budgets. Calibrate trial counts to your agent’s actual variance rather than picking arbitrary numbers.
- Combine with Codex CLI hooks. Wire AgentAssay into
PostSessionhooks for automated, per-run regression detection.
The regression testing gap for non-deterministic agent workflows is real, measurable, and — with frameworks like AgentAssay — now addressable. The question is no longer whether to test agent behaviour statistically, but how to integrate statistical testing into the development workflow you already have.
Citations
-
Bhardwaj, V.P. (2026). “AgentAssay: Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows.” arXiv:2603.02601. https://arxiv.org/abs/2603.02601 ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11
-
SitePoint (2026). “AI Agent Testing Automation: Developer Workflows for 2026.” https://www.sitepoint.com/ai-agent-testing-automation-developer-workflows-for-2026/ ↩
-
OpenAI (2026). “Non-interactive mode – Codex CLI.” OpenAI Developers. https://developers.openai.com/codex/noninteractive ↩
-
Qualixar (2026). “agentassay: Token-efficient stochastic testing for AI agents.” GitHub. https://github.com/qualixar/agentassay ↩ ↩2
-
OpenAI (2026). “Changelog – Codex.” OpenAI Developers. https://developers.openai.com/codex/changelog ↩ ↩2