CoDA-Bench: What the Data-Intensive Task Benchmark Means for Codex CLI File Discovery and Sandbox Strategy
CoDA-Bench: What the Data-Intensive Task Benchmark Means for Codex CLI File Discovery and Sandbox Strategy
Most coding agent benchmarks test whether an agent can fix a bug or pass a test suite. CoDA-Bench asks a harder question: can an agent find the right data files in a noisy filesystem and then write correct code against them? The answer, even for the best agents in mid-2026, is a qualified “sometimes” — and the failure modes map directly to configuration choices Codex CLI users make every day.
The Benchmark
CoDA-Bench (Zhang et al., June 2026, accepted at ICML 2026) is the first benchmark to jointly evaluate code intelligence and data intelligence in a single environment 1. It constructs a Linux sandbox from the Kaggle ecosystem: 1,009 tasks spanning 31 data communities, each environment averaging 980.8 files in mixed formats — CSV, JSON, Parquet, images, PDFs 1. The signal-to-noise ratio is brutal: on average, only 1.3 files per task are relevant, giving a signal ratio of 0.0105 1.
A harder subset, CoDA-Hard, raises the bar further: 119 tasks across 15 communities requiring at least two target files and 30+ effective lines of reference code, with a signal ratio of 0.0142 1.
How the Agents Performed
Seven agent configurations were tested. The results are sobering:
| Agent | Model | Discovery Accuracy | Execution Accuracy | Avg Tokens | Cost/Task |
|---|---|---|---|---|---|
| Mini-SWE-Agent | GPT-5.5 | 83.0% | 61.1% | — | — |
| OpenHands | GPT-5.5 | 82.1% | 59.7% | — | — |
| Codex CLI | GPT-5.5 | 74.9% | 60.3% | 380,558 | $1.39 |
| Claude Code | Sonnet 4.6 | 77.9% | 53.8% | 81,714 | $0.11 |
| Claude Code | Opus 4.7 | 77.3% | 51.9% | — | — |
On CoDA-Hard, Codex CLI scored 61.4% discovery accuracy and 47.9% execution accuracy 1.
Two findings stand out. First, Codex CLI consumed 4.7× more tokens than Claude Code (380,558 vs 81,714) for a comparable execution accuracy, suggesting its file discovery strategy is substantially less token-efficient 1. Second, wrapper frameworks like Mini-SWE-Agent improved GPT-5.5 performance by only 0.8 percentage points over native CLI tools, indicating the bottleneck is not orchestration but discovery strategy 1.
The Oracle Experiment: Discovery Is the Bottleneck
The paper’s most revealing experiment provides agents with oracle file paths — the exact files needed for each task. The results are dramatic:
- Claude Code (Sonnet 4.6): 45.4% → 73.1% execution accuracy (+27.7pp) 1
- OpenHands (GPT-5.5): 44.5% → 68.9% (+24.4pp) 1
On average, agents jumped to 71.0% accuracy with oracle data 1. This means roughly a third of all failures are pure discovery failures — the agent could have written the correct code if it had found the right files.
Why Discovery Fails
The error attribution analysis on 200 sampled failures reveals the split:
pie title GPT-5.5 Failure Attribution
"Code Generation" : 44.0
"Data Discovery" : 33.0
"Task Understanding" : 16.5
"Execution Errors" : 6.5
For mid-tier models (Kimi-K2.6), data discovery errors dominate at 40.7%, with code generation at 34.7% 1.
Crucially, agents struggle not with navigating large file counts but with distinguishing relevant data from semantically similar distractors 1. The correlation with signal-to-noise ratio (ρ=0.466, p<0.01) is stronger than with raw file count (ρ=-0.271, p=0.148) 1. Performance also collapses above 3GB of data volume, reaching near-zero above 8GB 1.
What This Means for Codex CLI Configuration
CoDA-Bench exposes four configuration gaps that Codex CLI users can address today.
1. Sandbox Scope: Widen the Readable Roots
Codex CLI’s default sandbox restricts file access to the current working directory 2. Data-intensive projects typically store datasets outside the code tree — in /data, ~/datasets, or mounted volumes. If the agent cannot read these paths, it cannot discover the files.
# ~/.codex/config.toml — expose data directories as readable roots
[permissions.data-project]
sandbox_mode = "read-only"
[permissions.data-project.filesystem]
readable_roots = [
"~/projects/my-analysis",
"~/datasets",
"/mnt/shared-data"
]
Use --add-dir at launch to expose additional roots without changing the config file:
codex --add-dir ~/datasets --add-dir /mnt/shared-data \
"Analyse the customer churn data and build a retention model"
2. AGENTS.md: Encode a Discovery-First Protocol
CoDA-Bench shows that agents which discover files first and code second perform better. Encode this as a constraint in your project’s AGENTS.md:
## Data Discovery Protocol
When working with data files:
1. **Inventory first.** Before writing any analysis code, list all files
in the data directory tree using `find` or `ls -R`. Record file names,
sizes, and formats.
2. **Inspect before assuming.** Read the first 20 lines of each candidate
file. Check column names, delimiters, and encodings. Do not assume
CSV structure from the file extension alone.
3. **Confirm relevance.** State which files you will use and why before
proceeding to code generation.
4. **Prefer Parquet over CSV** when both formats are available for the
same dataset — Parquet preserves types and is faster to parse.
This mirrors what the paper calls the “discovery-then-generate” pattern that oracle experiments show can recover up to 27.7pp of lost accuracy 1.
3. Subagent Isolation: Separate Discovery from Generation
CoDA-Bench’s token consumption data reveals that Codex CLI’s 380,558-token average largely comes from interleaving discovery and generation in a single context 1. A subagent pattern isolates the two concerns:
flowchart LR
A[Lead Agent] -->|"Discover relevant files"| B[Discovery Subagent]
B -->|"File manifest + schema summaries"| A
A -->|"Generate analysis code using manifest"| C[Code Generation Subagent]
C -->|"Executable script"| A
A -->|"Run and validate"| D[Execution + Verification]
The discovery subagent operates with broad read access and returns a compact manifest. The code generation subagent receives only the manifest and relevant file excerpts, keeping its context window clean. This mirrors the architectural insight behind Mini-SWE-Agent’s marginal improvement — it adds a structured discovery phase before generation 1.
In Codex CLI, trigger this with explicit subagent prompts:
codex "First: scan ~/datasets/kaggle-churn/ and produce a manifest of all
data files with their schemas (columns, types, row counts).
Then: using only the files in the manifest, write a churn prediction pipeline."
4. PostToolUse Hooks: Gate on Discovery Completeness
The 33% discovery failure rate suggests a PostToolUse hook that validates file references before execution:
#!/bin/bash
# .codex/hooks/post-tool-use-check-files.sh
# Reject code that references files not confirmed to exist
if [[ "$CODEX_TOOL_NAME" == "write" ]] && [[ "$CODEX_TOOL_OUTPUT" == *.py ]]; then
# Extract file path references from the generated code
referenced_files=$(grep -oP "(?:read_csv|read_parquet|open)\(['\"]([^'\"]+)" \
"$CODEX_TOOL_OUTPUT" | sed "s/.*['\"]//")
for f in $referenced_files; do
if [[ ! -f "$f" ]] && [[ ! -f "$(dirname "$CODEX_TOOL_OUTPUT")/$f" ]]; then
echo "REJECT: Generated code references non-existent file: $f"
echo "Run file discovery before generating code."
exit 1
fi
done
fi
5. Named Profiles: Route by Task Type
CoDA-Bench demonstrates that data-intensive tasks have fundamentally different resource profiles from code-centric tasks. A named profile encodes this:
# ~/.codex/config.toml
[profile.data-analysis]
model = "gpt-5.5"
sandbox_mode = "read-only"
approval_policy = "unless-allow-listed"
reasoning_effort = "high"
[profile.data-analysis.filesystem]
readable_roots = ["~/datasets", "/mnt/data"]
[profile.code-fix]
model = "gpt-5.4-mini"
sandbox_mode = "workspace"
approval_policy = "auto-edit"
reasoning_effort = "medium"
Launch with the appropriate profile:
codex --profile data-analysis "Analyse the Q2 revenue data"
codex --profile code-fix "Fix the failing unit tests in src/auth"
The cost difference is material. CoDA-Bench shows Codex CLI spending $1.39 per data task versus Claude Code’s $0.11 1. Profile-aware routing — using gpt-5.4-mini for code fixes and reserving gpt-5.5 with high reasoning effort for data tasks — can reduce aggregate spend substantially.
The Signal-to-Noise Problem
The paper’s most actionable finding is that semantic noise, not volume, drives discovery failure 1. A directory with 980 files is manageable if the relevant ones have distinctive names. It becomes hostile when dozens of files share similar prefixes, column structures, or domain terminology.
This has a direct Codex CLI implication: curate your data directories. Move working datasets into a clearly named subdirectory (/data/active/ vs /data/archive/). Use README files in data directories that describe what each file contains. The agent reads these — and CoDA-Bench shows it needs the help.
<!-- /data/active/README.md -->
## Active Datasets
| File | Description | Updated |
|------|-------------|---------|
| customers_q2_2026.parquet | Q2 customer records with churn labels | 2026-06-15 |
| transactions_q2_2026.csv | Raw transaction log, pipe-delimited | 2026-06-14 |
| features_engineered.parquet | Pre-computed feature matrix | 2026-06-16 |
Limitations and Open Questions
CoDA-Bench uses Kaggle datasets, which tend toward tabular data and standard formats. Real enterprise data environments include proprietary binary formats, databases behind authentication walls, and streaming data — none of which the benchmark captures. The 61.1% ceiling may be optimistic for production data engineering work.
The token consumption gap between Codex CLI and Claude Code (4.7×) warrants further investigation. It may reflect differences in default tool selection strategy, compaction behaviour, or model-level file navigation patterns rather than anything inherent to the CLI architecture 1.
Finally, the benchmark does not test incremental discovery — the common pattern where an analyst knows roughly what data exists and needs to find a specific variant. Real-world data work is rarely zero-knowledge exploration.
Key Takeaways
- Discovery is a third of the problem. One in three data task failures comes from finding the wrong files, not writing the wrong code.
- Token efficiency matters. Codex CLI’s brute-force discovery strategy costs 4.7× more tokens than alternatives for comparable accuracy.
- Structure your data directories. README files, clear naming, and separation of active from archived data directly reduce the semantic noise that CoDA-Bench identifies as the primary discovery obstacle.
- Separate discovery from generation. Subagent patterns and AGENTS.md protocols that enforce a discover-then-code sequence can recover the 24–28pp accuracy gap the oracle experiments reveal.
- Profile your tasks. Data-intensive work and code-centric work have different model, permission, and cost profiles — configure them separately.
Citations
-
Zhang, Y., Fan, J., Fan, M., Zhang, S., & Du, X. (2026). “CoDA-Bench: Can Code Agents Handle Data-Intensive Tasks?” arXiv:2606.15300. Accepted at ICML 2026. https://arxiv.org/abs/2606.15300 ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11 ↩12 ↩13 ↩14 ↩15 ↩16 ↩17 ↩18 ↩19 ↩20
-
OpenAI. (2026). “Sandbox — Codex CLI.” OpenAI Developer Documentation. https://developers.openai.com/codex/concepts/sandboxing ↩