SWE-Bench 5G and the Domain Knowledge Wall: What the First Telecom Coding Agent Benchmark Reveals About Specification-Driven Development with Codex CLI

Every SWE-bench variant to date has evaluated coding agents on Python-heavy, general-purpose repositories where the models’ pre-training data gives them a structural advantage. SWE-Bench 5G (arXiv:2604.26278, April 2026) breaks that pattern entirely ¹. It is the first benchmark to evaluate AI coding agents on 5G core network software — protocol-specified, distributed telecommunications code written in Go and C — and the results expose a failure mode that matters far beyond telecoms: agents can diagnose bugs at rates exceeding 91%, yet resolve only 10–30% of them ¹.

The gap between diagnosis and resolution is the domain knowledge wall. Understanding what it means for Codex CLI workflows is the subject of this article.

What SWE-Bench 5G Measures

The Dataset

SWE-Bench 5G draws from three open-source 5G core network implementations ¹:

Project	Language	Instances	Repositories	Notes
free5GC	Go	128	20	Linux Foundation; modular, one repo per Network Function ²
Open5GS	C	58	1	Monolithic; widely deployed in research testbeds ³
Magma	Go/Python	24	12	Originally Meta; hybrid access gateway architecture ⁴

That gives 210 validated task instances across 33 repositories, covering 7 Network Functions (AMF, SMF, UPF, NRF, AUSF, UDM, PCF) and spanning protocol-level bugs from nil-pointer dereferences to 3GPP specification compliance failures ¹.

The Dual Test Strategy

Standard SWE-bench relies on fail-to-pass unit tests. Telecom code makes that impractical — many functions require live SCTP connections, MongoDB contexts, or inter-NF signalling that cannot be trivially mocked. SWE-Bench 5G addresses this with a dual strategy ¹:

Strategy A (Direct Call): Invokes buggy functions with crash-triggering inputs, using Go’s defer/recover mechanism to detect panics. Applied to 68 instances.
Strategy B (Diff-Based Intent): Verifies that the patched source contains expected fix patterns rather than testing runtime behaviour. Applied to 142 instances.

flowchart TD
    A[Task Instance] --> B{Can function be<br/>called in isolation?}
    B -->|Yes| C[Strategy A:<br/>Direct Call Test]
    B -->|No: needs SCTP,<br/>MongoDB, inter-NF| D[Strategy B:<br/>Diff-Based Intent]
    C --> E[Go defer/recover<br/>detects panics]
    D --> F[Source pattern<br/>matching]
    E --> G[Pass / Fail]
    F --> G

3GPP Specification Injection

For task instances whose original issues reference 3GPP Technical Specification clauses, the researchers constructed concise specification context documents averaging 350 tokens. These documents summarise protocol semantics — field optionality constraints, NF behaviour expectations — without revealing the fix ¹.

The Results: Diagnosis Without Resolution

Four models were evaluated in multi-turn mode with a maximum of five iterations ¹:

Model	Resolve Rate	Patch Application Rate	Bug Diagnosis Rate
Claude Sonnet 4	30.0%	79.5%	>91%
GPT-4.1	20.0%	67.1%	>91%
Qwen3.5-Flash	10.0%	57.1%	>91%
Kimi-128k	10.0%	51.4%	>91%

The headline number is the chasm between diagnosis (>91% across the board) and resolution (10–30%). On general-purpose SWE-bench Verified, Claude Sonnet 4 resolves substantially more — the 30% ceiling on 5G tasks represents a significant regression attributable to domain complexity ¹.

Why Agents Fail

The dominant failure mode is incomplete fixes: agents address part of a bug but miss additional dereference sites, edge cases, or specification-mandated validation paths. Between 72 and 80 instances per model fell into this category ¹.

Patch format errors compound the problem for weaker models. Qwen3.5-Flash produced 75 instances and Kimi-128k produced 84 instances where correct reasoning produced non-matching SEARCH blocks — the agent understood the fix but could not express it as a valid patch against Go or C source with deeply nested NF signatures ¹.

pie title Failure Distribution (Claude Sonnet 4, 210 instances)
    "Resolved" : 63
    "Incomplete Fix" : 72
    "Patch Format Error" : 28
    "Wrong Diagnosis" : 19
    "Other" : 28

Specification Context: Conditional Gains

When 3GPP specification excerpts were injected for Claude Sonnet 4 across 50 task instances ¹:

Specification-dependent bugs (validation errors, protocol logic) saw +16.7% to +25.0% improvement in resolve rate.
Generic bugs (nil checks, crash guards) saw 0% improvement.
Token overhead: approximately 12%.

The implication is clear: domain knowledge helps, but only when the bug is specification-driven. Generic defensive coding requires iterative editing capability, not domain context.

What This Means for Codex CLI

SWE-Bench 5G is not just a telecoms curiosity. It exposes three structural challenges that affect any team using Codex CLI on domain-specific codebases — financial protocols, medical device firmware, automotive control systems, or any software governed by external specifications.

1. AGENTS.md as a Specification Bridge

The benchmark’s 350-token specification context documents are structurally identical to what belongs in a well-crafted AGENTS.md file ⁵. Codex CLI discovers AGENTS.md files by walking from the project root down to the current working directory, loading instructions at each level ⁵.

For specification-driven codebases, this means your AGENTS.md should include:

## Protocol Compliance

This project implements 3GPP TS 29.503 (UDM) and TS 29.509 (AUSF).

### Key Constraints
- All NAS message IEs marked OPTIONAL in the spec MUST be nil-checked
  before access. The 5GC treats missing optional IEs as valid absent values,
  not errors.
- SMF session context updates MUST validate against TS 23.502 procedure
  flows before applying state changes.
- Inter-NF HTTP/2 calls follow TS 29.500 error handling: 4xx responses
  are terminal, 5xx trigger retry with exponential backoff.

Research shows that developer-written AGENTS.md files improve agent task success rates by approximately 4% and reduce agent-generated bugs by 35–55% ⁶. SWE-Bench 5G’s specification injection results suggest that for protocol-governed code, the gains are substantially larger — up to 25% on specification-dependent bugs ¹.

2. The Incomplete Fix Problem and PostToolUse Hooks

The benchmark’s dominant failure mode — incomplete fixes that patch one dereference site but miss three others — maps directly to a known Codex CLI mitigation: PostToolUse hooks that run static analysis after every file write ⁷.

For Go codebases (like free5GC), a PostToolUse hook running go vet and staticcheck would catch many of the nil-pointer omissions that SWE-Bench 5G agents missed:

# codex.toml — PostToolUse hook for Go projects
[[hooks.post_tool_use]]
command = "go vet ./..."
on_failure = "report"

[[hooks.post_tool_use]]
command = "staticcheck ./..."
on_failure = "report"

For C codebases (like Open5GS), equivalent hooks with cppcheck or clang-tidy catch the same class of incomplete fixes:

[[hooks.post_tool_use]]
command = "cppcheck --enable=all --error-exitcode=1 src/"
on_failure = "report"

3. Patch Format Errors and Language-Specific Configuration

SWE-Bench 5G’s patch format error rates (28–84 instances per model) stem from Go and C’s stricter type systems and deeply nested function signatures ¹. Codex CLI’s model selection directly affects this: GPT-5.5, the current recommended model for Codex, produces substantially fewer patch format errors than smaller models on typed languages ⁸.

The configuration that matters:

# codex.toml — model selection for typed-language codebases
[model]
default = "gpt-5.5"   # Better patch formatting on Go/C
reasoning_effort = "high"  # Worth the token cost for specification-driven fixes

4. Multi-Turn Iteration Is Non-Negotiable

SWE-Bench 5G found that Qwen3.5-Flash achieved a 0% resolve rate in single-turn mode versus 10% in multi-turn (five iterations) ¹. This confirms what Codex CLI’s Goal Mode is designed for: sustained, multi-step reasoning where the agent can observe test failures, revise patches, and iterate ⁹.

For specification-heavy codebases, Goal Mode’s persistence across turns is essential. The agent needs to:

Read the failing test output
Cross-reference against specification constraints (from AGENTS.md)
Patch the primary site
Discover secondary sites via static analysis (PostToolUse hooks)
Re-run tests and iterate

sequenceDiagram
    participant Dev as Developer
    participant Goal as Goal Mode
    participant Agent as Codex Agent
    participant SA as Static Analysis<br/>(PostToolUse)
    participant Tests as Test Suite

    Dev->>Goal: Set goal: fix protocol bug
    Goal->>Agent: Read AGENTS.md spec context
    Agent->>Agent: Diagnose bug location
    Agent->>Agent: Apply primary patch
    Agent->>SA: Trigger PostToolUse hook
    SA-->>Agent: Report: 2 additional nil-check sites
    Agent->>Agent: Patch secondary sites
    Agent->>Tests: Run fail-to-pass tests
    Tests-->>Agent: 1 test still failing
    Agent->>Agent: Revise patch (iteration 2)
    Agent->>Tests: Re-run tests
    Tests-->>Goal: All tests pass
    Goal->>Dev: Goal complete

The Broader Pattern: Domain-Specific Benchmarks Are Coming

SWE-Bench 5G joins a growing family of domain-specific coding agent benchmarks that move beyond Python web applications ¹:

SWE-Bench 5G — Telecoms (Go, C)
SWE-PolyBench — Multi-language (12 languages) ¹⁰
SecureVibeBench — Security-critical C/C++ ¹¹
ProjDevBench — Greenfield project development ¹²

The common finding across all of them is that general-purpose benchmark performance does not predict domain-specific performance. Teams evaluating Codex CLI for regulated or specification-governed codebases should treat these specialised benchmarks as the relevant baseline, not SWE-bench Verified.

Practical Takeaways

Surface domain specifications in AGENTS.md. SWE-Bench 5G demonstrates up to 25% improvement on specification-dependent bugs when protocol context is injected. Keep specification summaries under project_doc_max_bytes (32 KiB default) ⁵.
Wire PostToolUse hooks for your language’s static analysis. The incomplete fix problem is systematic — agents consistently miss secondary patch sites. go vet, staticcheck, cppcheck, and clang-tidy catch what the model misses.
Select the strongest available model for typed languages. Patch format errors scale inversely with model capability. GPT-5.5 on Codex produces substantially fewer malformed patches on Go and C than smaller alternatives ⁸.
Use Goal Mode for specification-driven fixes. Single-turn performance on domain-specific code approaches zero. Multi-turn iteration with test feedback is the minimum viable workflow.
Benchmark your own domain. If your codebase is governed by external specifications (financial regulations, medical device standards, automotive safety), SWE-Bench 5G’s methodology — fail-to-pass tests in Docker containers with specification context injection — is directly replicable.

Citations

Chen, J., Tang, J., Yang, X., & Lv, Z. (2026). SWE-Bench 5G: Benchmarking AI Coding Agents on Telecom Network Engineering Tasks. arXiv:2604.26278. https://arxiv.org/abs/2604.26278 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴ ↩¹⁵
free5GC Project. Open source 5G core network based on 3GPP R15. GitHub. https://github.com/free5gc/free5gc ↩
Open5GS Project. C-language Open Source implementation for 5G Core and EPC (Release-19). GitHub. https://github.com/open5gs/open5gs ↩
Magma Project. Platform for building access networks and modular network services. https://magmacore.org/ ↩
OpenAI. Custom instructions with AGENTS.md — Codex. OpenAI Developers. https://developers.openai.com/codex/guides/agents-md ↩ ↩² ↩³
Augment Code. How to Build Your AGENTS.md (2026). https://www.augmentcode.com/guides/how-to-build-agents-md ↩
OpenAI. Security — Codex. OpenAI Developers. https://developers.openai.com/codex/security ↩
OpenAI. Changelog — Codex. OpenAI Developers. https://developers.openai.com/codex/changelog ↩ ↩²
OpenAI. Run long horizon tasks with Codex. OpenAI Developers. https://developers.openai.com/blog/run-long-horizon-tasks-with-codex ↩
SWE-PolyBench. Multi-Language Benchmark for Coding Agents. arXiv. Referenced from Codex Knowledge Base coverage. ↩
SecureVibeBench. Secure Vibe Coding Benchmark (ACL 2026). arXiv:2509.22097v5. ↩
ProjDevBench. Benchmarking AI Coding Agents on End-to-End Project Development. arXiv:2602.01655. ↩