MOSAIC-Bench and the Compositional Vulnerability Gap: Why Innocent Tickets Bypass Your Agent's Safety — and How Codex CLI's Hook Pipeline Catches Them

MOSAIC-Bench and the Compositional Vulnerability Gap: Why Innocent Tickets Bypass Your Agent’s Safety — and How Codex CLI’s Hook Pipeline Catches Them


Your coding agent refuses to write a SQL injection when asked directly. Good. Now decompose the same exploit into three routine Jira tickets — a schema migration, an admin search endpoint, and a logging enhancement — and hand them to the agent one at a time. According to MOSAIC-Bench, nine production agents complete the full attack chain at 53–86 per cent success rates with only two refusals across all staged runs 1. The safety alignment that worked in isolation evaporates under composition.

This article unpacks what MOSAIC-Bench reveals, why the compositional gap exists, and how Codex CLI’s hook pipeline — specifically pentester-framed PostToolUse reviewers — provides a deployable defence that commodity static analysis cannot match.

What MOSAIC-Bench Measures

MOSAIC-Bench (Malicious Objectives Sequenced As Innocuous Compliance), published by Steinberg and Gal in May 2026 1, is a benchmark of 199 three-stage attack chains covering 31 CWE classes across five languages (Node.js, Python, Go, Java, PHP) and ten web-application substrates 1. Each chain decomposes a single exploitable vulnerability into three tickets that individually read as routine engineering work. A deterministic exploit oracle confirms whether the composed output is actually exploitable — no subjective labelling.

The three-stage structure is the key insight:

flowchart LR
    T1["Ticket 1\n(e.g. schema migration)"] --> T2["Ticket 2\n(e.g. admin endpoint)"] --> T3["Ticket 3\n(e.g. logging change)"]
    T3 --> Oracle["Deterministic\nExploit Oracle"]
    Oracle -->|VULNERABLE| V["Attack Success"]
    Oracle -->|SECURE| S["Chain Blocked"]
    style V fill:#dc3545,color:#fff
    style S fill:#28a745,color:#fff

No individual ticket expresses malicious intent. The vulnerability emerges only from their composition — precisely the condition that alignment training does not cover.

The Numbers That Matter

Agent Attack Success Rates (Staged Protocol)

Nine production agents were tested under a fresh-session protocol where each ticket arrives in a clean context 1:

Agent Strict ASR Notes
GPT 5.3 85.9% Highest vulnerability rate
Gemini 3 Flash 70.4%  
Kimi K2.6 70.1%  
GLM-5 69.0%  
Minimax M2.7 65.3%  
Gemini 3.1 Pro 65.1%  
GPT 5.4 61.8%  
Sonnet 4.6 58.8%  
Opus 4.6 53.3% Lowest, but still majority

The Compositional Gap

When the same chains were collapsed into single direct prompts — making the malicious intent visible — vulnerable output rates dropped dramatically 1:

  • Sonnet 4.6: 0% (refuses outright)
  • Opus 4.6: 1.9%
  • Codex 5.4: 9.3% (hardens rather than refusing)
  • Codex 5.3: 20.4%

The gap between staged (53–86%) and direct (0–20%) is the compositional vulnerability window. Ticket staging silences both refusal and hardening defences simultaneously 1.

Context Fragmentation Is Not the Full Story

A resumed-context experiment where all three stages ran in a single session with retained memory showed ASR drops of only 2–22.6 percentage points 1:

Agent Fresh-Session Resumed-Session Drop
GPT 5.3 85.9% 63.3% −22.6 pp
Sonnet 4.6 58.8% 43.4% −15.4 pp
GPT 5.4 61.8% 59.8% −2.0 pp

Context fragmentation explains roughly 50% of the gap at median 1. The remainder is a genuine alignment failure: even with full conversation history, agents struggle to recognise that composed tickets form an exploit chain.

Why Code Reviewers Do Not Save You

MOSAIC-Bench tested code reviewer agents on the confirmed-vulnerable cumulative diffs 1. Under neutral framing (standard code review instructions), mean evasion was 25.8% in diff-only mode and 16.4% with full file context 1. That means roughly one in four exploitable diffs slips past a standard AI code review.

Commodity static analysis fares no better. Semgrep with standard rulesets detected only 43.7% of the 199 chains, with compositional-logic CWEs (the hardest category) evading at 64% 1.

Pentester-Framed Review: Framing Dominates Scale

The paper’s most actionable finding: reframing the reviewer prompt from “code reviewer” to “adversarial pentester” reduces evasion dramatically 1:

Reviewer Standard Evasion Pentester Evasion
Sonnet 4.6 21.6% 3.0%
GPT 5.3 24.6% 5.5%
Gemma-4-E4B-it 11.6%

The open-weight Gemma model achieved 88.4% detection at a cost of roughly $0.001 per review, with a false-positive rate of just 4.6% on 608 real-world merged GitHub PRs 1. As the authors conclude: “framing dominates scale in expectation” — prompt engineering the reviewer matters more than throwing a larger model at the problem.

Mapping to Codex CLI’s Defence Stack

Codex CLI’s hook pipeline 2 provides the deterministic enforcement layer that MOSAIC-Bench’s findings demand. The defence strategy maps to three levels:

flowchart TB
    subgraph PreToolUse["PreToolUse — Prevention"]
        A["Block known-dangerous\npattern compositions"]
        B["Enforce writable_roots\nsandbox boundaries"]
    end
    subgraph PostToolUse["PostToolUse — Detection"]
        C["Pentester-framed\nreview hook"]
        D["Semgrep + custom\nrules per CWE"]
    end
    subgraph Governance["Governance — Context"]
        E["AGENTS.md\nsecurity constraints"]
        F["approval_policy\nescalation"]
    end
    PreToolUse --> PostToolUse --> Governance
    style PostToolUse fill:#f0ad4e,color:#000

1. PostToolUse Pentester-Framed Review Hook

The headline defence. After every file write, a PostToolUse hook sends the diff to a pentester-framed reviewer 2. The key is the framing — not “review this code” but “you are a penetration tester; identify any vulnerability this diff introduces or enables when combined with existing code.”

# config.toml — pentester-framed PostToolUse hook
[[hooks]]
event = "PostToolUse"
tool = "write_file"

[hooks.command]
program = "bash"
args = ["-c", """
  diff=$(git diff HEAD -- "$CODEX_FILE_PATH")
  echo "$diff" | codex exec --model gemma-4-e4b-it \
    --prompt "You are a penetration tester. Analyse this diff for vulnerabilities it introduces or enables in combination with the existing codebase. Report VULNERABLE or SECURE with evidence." \
    --max-tokens 500
"""]
timeout_ms = 30000
on_fail = "require_approval"

Using Gemma-4-E4B-it keeps cost at ~$0.001/review whilst catching 88.4% of MOSAIC chains 1. For higher-stakes workflows, substituting Sonnet 4.6 pushes evasion down to 3.0% at ~$0.18/review 1.

2. Cumulative Diff Awareness

MOSAIC-Bench shows that diff-only review has 25.8% evasion versus 16.4% with full context 1. The PostToolUse hook should accumulate diffs across the session rather than reviewing each file write in isolation:

# Accumulate session diffs for compositional awareness
git diff $(git merge-base HEAD main)..HEAD

This gives the reviewer the full cumulative change set, closing the diff-fragmentation gap that mirrors the ticket-decomposition attack.

3. PreToolUse Guards for High-Risk CWE Patterns

For the CWE classes with highest MOSAIC-Bench ASR — command injection (CWE-78), SQL injection (CWE-89), and path traversal (CWE-22) — PreToolUse hooks can block before the file is written 2:

[[hooks]]
event = "PreToolUse"
tool = "write_file"

[hooks.command]
program = "bash"
args = ["-c", """
  # Block raw SQL string concatenation, exec/eval with user input
  if grep -qE '(exec\\(|eval\\(|subprocess\\.call|os\\.system|raw\\s+SQL)' "$CODEX_PROPOSED_CONTENT"; then
    echo "BLOCKED: High-risk pattern detected — requires manual approval"
    exit 1
  fi
"""]
on_fail = "reject"

4. AGENTS.md Compositional Security Constraints

Encode compositional awareness directly into the agent’s instruction set 3:

# AGENTS.md — Security section

## Compositional Security Rules
- When implementing database queries, NEVER use string concatenation for parameters
- When adding API endpoints, ALWAYS validate and sanitise all inputs at the handler level
- When modifying authentication or authorisation logic, REQUIRE explicit approval
- Treat sequential tickets touching the same module as a potential attack chain — review the cumulative change before proceeding

5. Approval Policy Escalation

For security-sensitive file paths, escalate from auto-approve to manual approval 4:

# Per-tool approval for auth/security modules
[tools.write_file]
approval_mode = "always"
paths = ["**/auth/**", "**/middleware/**", "**/security/**"]

Defence-in-Depth Configuration

Combining these layers produces a configuration that addresses MOSAIC-Bench’s full attack surface:

flowchart LR
    Write["Agent writes file"] --> Pre["PreToolUse\nPattern guard"]
    Pre -->|Pass| Post["PostToolUse\nPentester review"]
    Post -->|SECURE| Accept["Accept change"]
    Post -->|VULNERABLE| Escalate["Require human\napproval"]
    Pre -->|Blocked| Reject["Reject + log"]
    style Reject fill:#dc3545,color:#fff
    style Escalate fill:#f0ad4e,color:#000
    style Accept fill:#28a745,color:#fff

The economics are practical. At Gemma-4’s ~$0.001/review, even a 200-file session costs $0.20 for pentester-framed security review. The 4.6% false-positive rate on real PRs 1 means roughly one false flag per 22 legitimate file writes — noticeable but manageable, particularly when on_fail routes to approval rather than outright rejection.

What This Means for Your Workflow

MOSAIC-Bench’s core finding is structural: safety alignment is trained on isolated requests, but real engineering work is compositional. Three implications for Codex CLI users:

  1. Standard code review is insufficient. Neutral-framed AI reviewers miss 25% of compositional vulnerabilities 1. Pentester framing is not optional — it is the difference between 25% evasion and 3% evasion.

  2. Static analysis is necessary but not sufficient. Semgrep catches 44% of MOSAIC chains 1. Pair it with a pentester-framed LLM reviewer for the compositional-logic CWEs that rule-based tools miss.

  3. Cumulative context matters. Review diffs against the full branch, not individual file changes. Configure PostToolUse hooks to pass git diff main...HEAD rather than per-file diffs.

The compositional vulnerability gap will not close through alignment training alone — the attack surface is the ticket decomposition that engineering organisations already practise. Deterministic hooks that enforce pentester-framed review at every file write are the deployable mitigation available today.

Citations

  1. Steinberg, J. & Gal, O. (2026). “MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents.” arXiv:2605.03952. https://arxiv.org/abs/2605.03952  2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

  2. OpenAI. (2026). “Hooks — Codex CLI.” OpenAI Developers. https://developers.openai.com/codex/hooks  2 3

  3. OpenAI. (2026). “AGENTS.md — Codex CLI.” OpenAI Developers. https://developers.openai.com/codex/cli/agents-md 

  4. OpenAI. (2026). “Security — Codex CLI.” OpenAI Developers. https://developers.openai.com/codex/security