MOSAIC-Bench and the Compositional Vulnerability Gap: Why Innocent Tickets Bypass Your Agent’s Safety — and How Codex CLI’s Hook Pipeline Catches Them

Your coding agent refuses to write a SQL injection when asked directly. Good. Now decompose the same exploit into three routine Jira tickets — a schema migration, an admin search endpoint, and a logging enhancement — and hand them to the agent one at a time. According to MOSAIC-Bench, nine production agents complete the full attack chain at 53–86 per cent success rates with only two refusals across all staged runs ¹. The safety alignment that worked in isolation evaporates under composition.

This article unpacks what MOSAIC-Bench reveals, why the compositional gap exists, and how Codex CLI’s hook pipeline — specifically pentester-framed PostToolUse reviewers — provides a deployable defence that commodity static analysis cannot match.

What MOSAIC-Bench Measures

MOSAIC-Bench (Malicious Objectives Sequenced As Innocuous Compliance), published by Steinberg and Gal in May 2026 ¹, is a benchmark of 199 three-stage attack chains covering 31 CWE classes across five languages (Node.js, Python, Go, Java, PHP) and ten web-application substrates ¹. Each chain decomposes a single exploitable vulnerability into three tickets that individually read as routine engineering work. A deterministic exploit oracle confirms whether the composed output is actually exploitable — no subjective labelling.

The three-stage structure is the key insight:

flowchart LR
    T1["Ticket 1\n(e.g. schema migration)"] --> T2["Ticket 2\n(e.g. admin endpoint)"] --> T3["Ticket 3\n(e.g. logging change)"]
    T3 --> Oracle["Deterministic\nExploit Oracle"]
    Oracle -->|VULNERABLE| V["Attack Success"]
    Oracle -->|SECURE| S["Chain Blocked"]
    style V fill:#dc3545,color:#fff
    style S fill:#28a745,color:#fff

No individual ticket expresses malicious intent. The vulnerability emerges only from their composition — precisely the condition that alignment training does not cover.

The Numbers That Matter

Agent Attack Success Rates (Staged Protocol)

Nine production agents were tested under a fresh-session protocol where each ticket arrives in a clean context ¹:

Agent	Strict ASR	Notes
GPT 5.3	85.9%	Highest vulnerability rate
Gemini 3 Flash	70.4%
Kimi K2.6	70.1%
GLM-5	69.0%
Minimax M2.7	65.3%
Gemini 3.1 Pro	65.1%
GPT 5.4	61.8%
Sonnet 4.6	58.8%
Opus 4.6	53.3%	Lowest, but still majority

The Compositional Gap

When the same chains were collapsed into single direct prompts — making the malicious intent visible — vulnerable output rates dropped dramatically ¹:

Sonnet 4.6: 0% (refuses outright)
Opus 4.6: 1.9%
Codex 5.4: 9.3% (hardens rather than refusing)
Codex 5.3: 20.4%

The gap between staged (53–86%) and direct (0–20%) is the compositional vulnerability window. Ticket staging silences both refusal and hardening defences simultaneously ¹.

Context Fragmentation Is Not the Full Story

A resumed-context experiment where all three stages ran in a single session with retained memory showed ASR drops of only 2–22.6 percentage points ¹:

Agent	Fresh-Session	Resumed-Session	Drop
GPT 5.3	85.9%	63.3%	−22.6 pp
Sonnet 4.6	58.8%	43.4%	−15.4 pp
GPT 5.4	61.8%	59.8%	−2.0 pp

Context fragmentation explains roughly 50% of the gap at median ¹. The remainder is a genuine alignment failure: even with full conversation history, agents struggle to recognise that composed tickets form an exploit chain.

Why Code Reviewers Do Not Save You

MOSAIC-Bench tested code reviewer agents on the confirmed-vulnerable cumulative diffs ¹. Under neutral framing (standard code review instructions), mean evasion was 25.8% in diff-only mode and 16.4% with full file context ¹. That means roughly one in four exploitable diffs slips past a standard AI code review.

Commodity static analysis fares no better. Semgrep with standard rulesets detected only 43.7% of the 199 chains, with compositional-logic CWEs (the hardest category) evading at 64% ¹.

Pentester-Framed Review: Framing Dominates Scale

The paper’s most actionable finding: reframing the reviewer prompt from “code reviewer” to “adversarial pentester” reduces evasion dramatically ¹:

Reviewer	Standard Evasion	Pentester Evasion
Sonnet 4.6	21.6%	3.0%
GPT 5.3	24.6%	5.5%
Gemma-4-E4B-it	—	11.6%

The open-weight Gemma model achieved 88.4% detection at a cost of roughly $0.001 per review, with a false-positive rate of just 4.6% on 608 real-world merged GitHub PRs ¹. As the authors conclude: “framing dominates scale in expectation” — prompt engineering the reviewer matters more than throwing a larger model at the problem.

Mapping to Codex CLI’s Defence Stack

Codex CLI’s hook pipeline ² provides the deterministic enforcement layer that MOSAIC-Bench’s findings demand. The defence strategy maps to three levels:

flowchart TB
    subgraph PreToolUse["PreToolUse — Prevention"]
        A["Block known-dangerous\npattern compositions"]
        B["Enforce writable_roots\nsandbox boundaries"]
    end
    subgraph PostToolUse["PostToolUse — Detection"]
        C["Pentester-framed\nreview hook"]
        D["Semgrep + custom\nrules per CWE"]
    end
    subgraph Governance["Governance — Context"]
        E["AGENTS.md\nsecurity constraints"]
        F["approval_policy\nescalation"]
    end
    PreToolUse --> PostToolUse --> Governance
    style PostToolUse fill:#f0ad4e,color:#000

1. PostToolUse Pentester-Framed Review Hook

The headline defence. After every file write, a PostToolUse hook sends the diff to a pentester-framed reviewer ². The key is the framing — not “review this code” but “you are a penetration tester; identify any vulnerability this diff introduces or enables when combined with existing code.”

# config.toml — pentester-framed PostToolUse hook
[[hooks]]
event = "PostToolUse"
tool = "write_file"

[hooks.command]
program = "bash"
args = ["-c", """
  diff=$(git diff HEAD -- "$CODEX_FILE_PATH")
  echo "$diff" | codex exec --model gemma-4-e4b-it \
    --prompt "You are a penetration tester. Analyse this diff for vulnerabilities it introduces or enables in combination with the existing codebase. Report VULNERABLE or SECURE with evidence." \
    --max-tokens 500
"""]
timeout_ms = 30000
on_fail = "require_approval"

Using Gemma-4-E4B-it keeps cost at ~$0.001/review whilst catching 88.4% of MOSAIC chains ¹. For higher-stakes workflows, substituting Sonnet 4.6 pushes evasion down to 3.0% at ~$0.18/review ¹.

2. Cumulative Diff Awareness

MOSAIC-Bench shows that diff-only review has 25.8% evasion versus 16.4% with full context ¹. The PostToolUse hook should accumulate diffs across the session rather than reviewing each file write in isolation:

# Accumulate session diffs for compositional awareness
git diff $(git merge-base HEAD main)..HEAD

This gives the reviewer the full cumulative change set, closing the diff-fragmentation gap that mirrors the ticket-decomposition attack.

3. PreToolUse Guards for High-Risk CWE Patterns

For the CWE classes with highest MOSAIC-Bench ASR — command injection (CWE-78), SQL injection (CWE-89), and path traversal (CWE-22) — PreToolUse hooks can block before the file is written ²:

[[hooks]]
event = "PreToolUse"
tool = "write_file"

[hooks.command]
program = "bash"
args = ["-c", """
  # Block raw SQL string concatenation, exec/eval with user input
  if grep -qE '(exec\\(|eval\\(|subprocess\\.call|os\\.system|raw\\s+SQL)' "$CODEX_PROPOSED_CONTENT"; then
    echo "BLOCKED: High-risk pattern detected — requires manual approval"
    exit 1
  fi
"""]
on_fail = "reject"

4. AGENTS.md Compositional Security Constraints

Encode compositional awareness directly into the agent’s instruction set ³:

# AGENTS.md — Security section

## Compositional Security Rules
- When implementing database queries, NEVER use string concatenation for parameters
- When adding API endpoints, ALWAYS validate and sanitise all inputs at the handler level
- When modifying authentication or authorisation logic, REQUIRE explicit approval
- Treat sequential tickets touching the same module as a potential attack chain — review the cumulative change before proceeding

5. Approval Policy Escalation

For security-sensitive file paths, escalate from auto-approve to manual approval ⁴:

# Per-tool approval for auth/security modules
[tools.write_file]
approval_mode = "always"
paths = ["**/auth/**", "**/middleware/**", "**/security/**"]

Defence-in-Depth Configuration

Combining these layers produces a configuration that addresses MOSAIC-Bench’s full attack surface:

flowchart LR
    Write["Agent writes file"] --> Pre["PreToolUse\nPattern guard"]
    Pre -->|Pass| Post["PostToolUse\nPentester review"]
    Post -->|SECURE| Accept["Accept change"]
    Post -->|VULNERABLE| Escalate["Require human\napproval"]
    Pre -->|Blocked| Reject["Reject + log"]
    style Reject fill:#dc3545,color:#fff
    style Escalate fill:#f0ad4e,color:#000
    style Accept fill:#28a745,color:#fff

The economics are practical. At Gemma-4’s ~$0.001/review, even a 200-file session costs $0.20 for pentester-framed security review. The 4.6% false-positive rate on real PRs ¹ means roughly one false flag per 22 legitimate file writes — noticeable but manageable, particularly when on_fail routes to approval rather than outright rejection.

What This Means for Your Workflow

MOSAIC-Bench’s core finding is structural: safety alignment is trained on isolated requests, but real engineering work is compositional. Three implications for Codex CLI users:

Standard code review is insufficient. Neutral-framed AI reviewers miss 25% of compositional vulnerabilities ¹. Pentester framing is not optional — it is the difference between 25% evasion and 3% evasion.
Static analysis is necessary but not sufficient. Semgrep catches 44% of MOSAIC chains ¹. Pair it with a pentester-framed LLM reviewer for the compositional-logic CWEs that rule-based tools miss.
Cumulative context matters. Review diffs against the full branch, not individual file changes. Configure PostToolUse hooks to pass git diff main...HEAD rather than per-file diffs.

The compositional vulnerability gap will not close through alignment training alone — the attack surface is the ticket decomposition that engineering organisations already practise. Deterministic hooks that enforce pentester-framed review at every file write are the deployable mitigation available today.

Citations

Steinberg, J. & Gal, O. (2026). “MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents.” arXiv:2605.03952. https://arxiv.org/abs/2605.03952 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴ ↩¹⁵ ↩¹⁶ ↩¹⁷ ↩¹⁸ ↩¹⁹
OpenAI. (2026). “Hooks — Codex CLI.” OpenAI Developers. https://developers.openai.com/codex/hooks ↩ ↩² ↩³
OpenAI. (2026). “AGENTS.md — Codex CLI.” OpenAI Developers. https://developers.openai.com/codex/cli/agents-md ↩
OpenAI. (2026). “Security — Codex CLI.” OpenAI Developers. https://developers.openai.com/codex/security ↩