AgentLens and the Lucky Pass Problem: Why 10.7% of Your Agent’s Passing Tests Are Flukes — and How to Configure Codex CLI for Process Quality

Your coding agent passes the tests. Ship it? Not so fast. A study of 2,614 agent trajectories across eight model backends reveals that 10.7% of passing solutions are Lucky Passes — tests that pass despite regression cycles, blind retries, missing verification, or temporally disordered reasoning ¹. Worse, ranking models by pass rate alone produces a leaderboard that disagrees with process-quality rankings by up to five positions ¹. For teams running Codex CLI in production, this has immediate implications for model selection, hook configuration, and AGENTS.md discipline.

What AgentLens Measures

Sahoo et al. (arXiv:2605.12925, May 2026) introduce AgentLens, a framework that moves beyond binary pass/fail evaluation to assess the process quality of coding agent trajectories ¹. The core insight: two trajectories that produce identical patches can differ radically in how they got there. One may follow a principled explore → implement → verify sequence; the other may thrash through blind retries until something sticks.

The Four Intent Stages

AgentLens classifies every agent action into one of four cognitive phases ¹:

Exploration (E) — file reading, searching, listing directory contents
Implementation (I) — source file editing and patch application
Verification (V) — test execution, error checking, output validation
Orchestration (O) — bookkeeping, reasoning, planning

Critically, the labelling is context-sensitive: a read_file call before any patches counts as Exploration, but the same call after Implementation counts as Verification ¹. A seven-rule cascade achieves κ = 0.933 inter-annotator agreement across seven annotators ¹.

The Prefix Tree Acceptor

For each task, AgentLens merges k ≥ 2 passing trajectories into a Prefix Tree Acceptor (PTA) — a directed acyclic graph representing known-good solution strategies ¹. States are matched via a confidence-weighted cascade: exact content hash (1.0), file-scope AST matching (0.90), line-range overlap with ≥ 30% threshold (0.80–0.95), and semantic terminal grouping (0.70–0.85) ¹. The study found k = 5 provides optimal AUROC (0.777) balancing precision and coverage ¹.

The Quality Score

A composite 0–100 score combines four signals ¹:

f(τc, G) = 0.20·Φstruct + 0.15·Φcov + 0.30·(100·Φcoh) + 0.35·(100·Φtemp)

Signal	Weight	What it measures
Structural alignment (Φstruct)	20%	How closely the trajectory’s action graph matches the PTA
Set coverage (Φcov)	15%	Whether the trajectory touches the same files and edits as known-good solutions
Coherence (Φcoh)	30%	Consistency of intent-stage transitions (E→I→V, not random jumps)
Temporal profile (Φtemp)	35%	Whether exploration precedes implementation, which precedes verification

Thresholds: Ideal ≥ 70, Solid 47–69, Lucky < 47¹.

The Numbers That Should Worry You

Across the 1,815-trajectory evaluation subset (47 PTA-eligible tasks from SWE-bench Verified), AgentLens classifies 229 trajectories as Ideal (20.2%), 785 as Solid (69.1%), and 122 as Lucky (10.7%) ¹.

Model Rankings Diverge

The headline finding is that pass-rate rankings and quality-score rankings disagree significantly ¹:

Model	Pass %	Pass Rank	Quality Score	Quality Rank	Lucky %
Claude Opus 4.5	87.9%	1	66.2	2	0.5%
Claude Sonnet 4.5	86.8%	2	67.4	1	1.0%
Claude Opus 4.6	77.3%	3	56.7	6	18.7%
GPT-5.2-Codex	64.6%	4	56.1	7	19.4%
GPT-4.1	59.9%	5	54.7	8	23.2%
GPT-5.3-Codex	45.9%	6	58.3	5	15.3%
Gemini 2.5 Pro	42.9%	7	59.2	4	7.6%
GPT-4o	34.9%	8	63.4	3	4.1%

Claude Opus 4.6 drops from rank 3 to rank 6. GPT-4o climbs from rank 8 to rank 3¹. The chi-square association between model and Lucky category is χ²(28) = 102.47, p < 0.0001 — models have strongly characteristic failure modes ¹.

The Five Lucky Pass Mechanisms

AgentLens decomposes Lucky Passes into five recurring categories ¹:

C1: Minimal & Unverified (15.6%) — overconfident agent produces short trajectories with no verification stage
C2: Brute-Force Convergence (34.4%) — the agent lacks a plan and converges through repeated low-coherence attempts
C3: Incomplete Implementation (33.6%) — partial fixes that pass insufficient test suites
C4: Excessive Exploration (4.1%) — unfocused, extended search without termination discipline
C5: Divergent-but-Valid (12.3%) — alternative approaches that happen to work but follow no established strategy

The cost is real: Lucky trajectories waste 11.4 steps on blind retries versus 2.7 in Ideal trajectories — a 4.2× increase ¹.

Mapping AgentLens to Codex CLI Configuration

The Lucky Pass findings map directly to Codex CLI’s configuration surface. Each mechanism has a defensive configuration pattern.

Defence Against C1 (Minimal & Unverified): Mandatory Verification in AGENTS.md

The agent skips verification entirely. Force it:

<!-- AGENTS.md -->
## Verification Rules

- After EVERY implementation change, run the relevant test suite before proceeding
- Never submit a patch without at least one successful test run
- If no tests exist for the changed code, write at least one before marking done

Combine with a PostToolUse hook that gates on test execution ²:

#!/bin/bash
# .codex/hooks/post-tool-use.sh
# Block if implementation detected but no test run followed

if [[ "$CODEX_TOOL_NAME" == "bash" ]]; then
  LAST_ACTIONS=$(cat "$CODEX_TRAJECTORY_LOG" | tail -5)
  if echo "$LAST_ACTIONS" | grep -q "apply_patch\|edit_file" && \
     ! echo "$LAST_ACTIONS" | grep -q "pytest\|npm test\|go test\|cargo test"; then
    echo "BLOCKED: Implementation detected without subsequent verification"
    exit 2  # Exit 2 blocks the action
  fi
fi

Defence Against C2 (Brute-Force Convergence): Token Budgets and Retry Limits

Brute-force convergence burns tokens on repeated low-coherence attempts. Codex CLI v0.142’s configurable rollout token budgets provide a hard stop ³:

# config.toml
[goal]
rollout_token_budget = 50000  # Hard ceiling per goal
tool_timeout_sec = 120        # Prevent individual tool calls from hanging

Complement with AGENTS.md retry discipline:

## Retry Policy

- Maximum 3 attempts at the same approach before stopping to re-plan
- If tests fail twice with the same error, explain your diagnosis before the next attempt
- Never apply a patch that reverts a previous patch without explicit justification

Defence Against C3 (Incomplete Implementation): Test Coverage Gates

Partial fixes pass weak test suites. Add coverage verification:

#!/bin/bash
# .codex/hooks/post-tool-use.sh (coverage section)
if [[ "$CODEX_TOOL_NAME" == "bash" ]] && echo "$CODEX_TOOL_INPUT" | grep -q "pytest"; then
  COVERAGE=$(python -m coverage report --format=total 2>/dev/null)
  if [[ -n "$COVERAGE" ]] && (( $(echo "$COVERAGE < 70" | bc -l) )); then
    echo "WARNING: Test coverage at ${COVERAGE}% — below 70% threshold"
    # Log but don't block; coverage thresholds are project-specific
  fi
fi

Defence Against C4 (Excessive Exploration): Exploration Budgets

Unfocused search wastes context. AGENTS.md can enforce exploration discipline:

## Exploration Budget

- Spend no more than 5 file reads understanding the problem before proposing a solution
- Use grep/ripgrep for targeted search — do not read entire files when a search suffices
- State your hypothesis BEFORE exploring; if exploration invalidates it, state a new one

Model Routing: Quality Over Pass Rate

The AgentLens data suggests that model selection based on pass rate alone is misleading. For Codex CLI, this informs model routing via named profiles ⁴:

# config.toml — quality-optimised profiles
[profiles.quality-first]
model = "o4-mini"  # Lower pass rate but higher process coherence
approval_policy = "unless-allow-listed"

[profiles.throughput]
model = "gpt-5.5"  # Higher pass rate for well-understood tasks
approval_policy = "on-failure"

The principle: use higher-quality-process models for unfamiliar codebases and novel tasks; reserve high-pass-rate models for well-trodden paths where Lucky Passes are less likely to cause downstream problems.

flowchart TD
    A[New Task] --> B{Task Type?}
    B -->|Novel / unfamiliar codebase| C[quality-first profile]
    B -->|Routine / well-tested area| D[throughput profile]
    C --> E[Lower Lucky % model]
    D --> F[Higher pass-rate model]
    E --> G[PostToolUse verification hooks]
    F --> G
    G --> H{Quality gate passed?}
    H -->|Yes| I[Ship]
    H -->|No| J[Re-plan with quality-first]
    J --> G

Trajectory Observability

AgentLens’s four-phase intent labelling (E→I→V→O) provides a template for monitoring your own agent runs. Codex CLI’s JSONL event stream can be piped to an analysis script that flags disordered sequences ⁵:

# Monitor trajectory phase ordering in real time
codex --event-stream json goal "Fix issue #123" 2>&1 | \
  python3 scripts/phase-monitor.py --warn-on "I→E" --block-on "I→I→I"

A trajectory that shows repeated Implementation without intervening Verification (I→I→I) is exhibiting C2 brute-force convergence in real time.

What This Means for Teams

The Lucky Pass problem is not academic. A 10.7% Lucky rate means roughly one in ten of your agent’s “successful” patches reached correctness through a process that would not survive the next refactor, the next edge case, or the next code review. The 4.2× retry waste translates directly to token spend.

Three actionable takeaways:

Do not choose models by pass rate alone. AgentLens shows GPT-4o at rank 3 for quality despite rank 8 for pass rate. Your model routing should factor in process coherence, not just outcome.
Enforce the E→I→V temporal pattern. AGENTS.md instructions and PostToolUse hooks can structurally prevent the two most common Lucky mechanisms (C2 brute-force and C1 missing verification), which together account for 50% of Lucky Passes.
Budget against thrashing. Rollout token budgets, retry limits, and exploration caps prevent the agent from stumbling into correctness through sheer volume of attempts.

The goal is not to prevent the agent from succeeding — it is to ensure that when it succeeds, the process is one you can trust.

Citations

Sahoo, P., Mittal, G., Li, X., Ma, S., Steenhoek, B., Lin, P. & Hu, Y. (2026). “AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation.” arXiv:2605.12925. https://arxiv.org/abs/2605.12925 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴ ↩¹⁵ ↩¹⁶ ↩¹⁷
OpenAI. (2026). “Hooks – Codex.” OpenAI Developers. https://developers.openai.com/codex/hooks ↩
OpenAI. (2026). “Changelog – Codex.” Codex CLI v0.142.0 release notes. https://developers.openai.com/codex/changelog ↩
OpenAI. (2026). “Command line options – Codex CLI.” OpenAI Developers. https://developers.openai.com/codex/cli/reference ↩
OpenAI. (2026). “Custom instructions with AGENTS.md – Codex.” OpenAI Developers. https://developers.openai.com/codex/guides/agents-md ↩