The 400K LOC Threshold: What 1,281 Agent Runs Reveal About Codex CLI Performance in Large Codebases

Sourcegraph’s CodeScaleBench study analysed 1,281 agent runs across more than 40 enterprise-scale open-source repositories and surfaced a finding that should reshape how senior engineers configure Codex CLI: the difference between complete failure and near-perfect completion is not model intelligence — it is efficient access to context ¹. Below the 400,000 lines-of-code mark, standard tools work well enough. Above it, agents relying on grep, file read, and glob start failing systematically ². This article maps those five failure patterns onto Codex CLI and provides configuration recipes for each.

The Study: CodeScaleBench in Brief

CodeScaleBench is a living benchmark suite comprising 370 software engineering tasks divided into two parts ³:

CodeScaleBench-SDLC — 150 tasks spanning the full software development lifecycle, using SWE-Bench-style patch verification plus a ground_truth.json produced by a curator agent for context-retrieval metrics.
CodeScaleBench-Org — 220 tasks requiring organisation-level codebase navigation across multiple repositories.

The benchmark targets codebases with millions of lines, complex dependency graphs, and cross-repository context — the environments where enterprise developers actually work ³. Sourcegraph’s analysis of 1,281 runs across these tasks identified five recurring failure patterns ².

Pattern 1: The 400K LOC Threshold

At roughly 400,000 lines of code, agents hit a structural ceiling. Following imports through 22,000 files becomes unmanageable; the search space overwhelms traditional approaches ². For Codex CLI, this manifests as the agent burning context tokens on exploratory file reads that return nothing useful.

Codex CLI Defence: Scoped Context with AGENTS.md File Maps

The most effective mitigation is pre-loading architectural knowledge into AGENTS.md so the agent starts with codebase understanding rather than reconstructing it through trial-and-error ²:

# AGENTS.md — large codebase file map

## Architecture Overview
This is a modular monolith with 14 bounded contexts under src/domains/.
Each domain follows: handler.go -> service.go -> repository.go -> models.go

## Key Entry Points
- API routes: src/api/routes/ (one file per domain)
- Domain logic: src/domains/{name}/service.go
- Shared types: src/shared/types/
- Database migrations: db/migrations/

## Navigation Rules
- Never search test files for production logic
- The canonical handler for each domain is in src/domains/{name}/handler.go
- Legacy code in src/legacy/ is read-only; do not modify

Pair this with --add-dir for cross-repository tasks and a tight writable_roots to prevent the agent from wandering into irrelevant directories:

# config.toml — large codebase profile
[profile.large-repo]
model = "o4-mini"
model_reasoning_effort = "high"
writable_roots = ["src/domains/payments", "src/domains/billing"]

Pattern 2: Finding Code vs. Finding the Right Code

Keyword search in a large codebase returns hundreds of matches across test files, legacy code, and documentation. Agents anchor on the wrong result when distinguishing between multiple handler.go files or __init__.py modules scattered across packages ². This is the retrieval precision problem: volume does not equal accuracy.

Codex CLI Defence: Deterministic Code Search via MCP

The Sourcegraph MCP server exposes 13 tools to any MCP-compatible agent, providing deterministic, exact-match results across every repository ⁴. It is backed by SCIP (the open Protobuf-based code intelligence protocol), providing go-to-definition and find-all-references navigation across repositories ⁵.

# config.toml — Sourcegraph MCP server
[mcp-servers.sourcegraph]
command = "npx"
args = ["-y", "@anthropic-ai/sourcegraph-mcp"]
env = { SOURCEGRAPH_ACCESS_TOKEN = "sgp_..." }

With this configured, the agent can issue precise structural queries — “find all callers of PaymentService.Process” — rather than grepping for string matches. The distinction matters: grep returns text; a code intelligence server returns semantic references ⁵.

Where Sourcegraph is unavailable, a language server via the LSP MCP bridge provides a lighter alternative for single-repository work:

[mcp-servers.lsp-bridge]
command = "npx"
args = ["-y", "mcp-lsp-bridge", "--language", "go", "--workspace", "."]

Pattern 3: Half-Finished Refactoring as Hidden Bugs

Locally correct changes can break broader systems without surface-level indication. The study found that performance drops significantly in multi-repository tasks compared to single-repo work ². The “80% problem” describes this precisely: agents reliably complete visible work but miss the invisible 20% outside their context window — auth middleware, API DTOs, audit logging paths, integration tests in sibling repositories, and frontend permission guards ⁶.

Codex CLI Defence: Multi-Directory Coordination and Stop Hooks

Use --add-dir to ensure the agent can see all affected repositories:

codex --cd ~/code/payments-api \
  --add-dir ~/code/api-gateway \
  --add-dir ~/code/shared-types \
  --sandbox workspace-write \
  "Rename PaymentStatus enum values from snake_case to PascalCase across all services"

Then add a Stop hook that runs cross-repository validation before the agent declares completion:

# config.toml — cross-repo verification hook
[[hooks]]
event = "on_agent_stop"
command = "bash"
args = ["-c", """
echo 'Checking for incomplete refactoring...'
grep -rn 'payment_status\|PAYMENT_STATUS' src/ --include='*.go' && \
  echo 'INCOMPLETE: snake_case references remain' && exit 1
echo 'Refactoring complete across all directories'
"""]

Pattern 4: Tool Thrashing

This is the most striking finding. One benchmark task saw a baseline agent make 96 tool calls over 84 minutes; the same task with proper tooling took five calls in under five minutes ². Across the full dataset, proper context infrastructure delivered a 30% cost reduction and 38% speed improvement ². The compounding effect is insidious: failed searches accumulate in the conversation history, consuming context tokens and limiting output quality in later turns.

graph LR
    A[Agent receives task] --> B{Has code intelligence?}
    B -->|No| C[grep/glob search]
    C --> D[Hundreds of matches]
    D --> E[Read wrong files]
    E --> F[Context filled with noise]
    F --> G[96 tool calls / 84 min]
    B -->|Yes| H[Structural query]
    H --> I[Precise references]
    I --> J[Read correct files]
    J --> K[5 tool calls / 5 min]
    style G fill:#f66,color:#fff
    style K fill:#6f6,color:#fff

Codex CLI Defence: Compact Proactively and Limit Tool Output

Two configuration levers directly attack tool thrashing:

# config.toml — anti-thrashing settings
model_auto_compact_token_limit = 160000   # Compact at 80% of 200K, not 95%
tool_output_token_limit = 4000            # Truncate verbose tool output

Lowering model_auto_compact_token_limit to 80-85% of context capacity provides headroom for the post-compaction re-read cycle and prevents the cascading compaction problem ⁷. Setting tool_output_token_limit prevents a single large file read from consuming disproportionate context.

Additionally, use profiles to switch reasoning effort based on task complexity:

[profile.explore]
model = "o4-mini"
model_reasoning_effort = "low"

[profile.implement]
model = "o3"
model_reasoning_effort = "high"

Use the explore profile for navigation and the implement profile when the agent has located the correct files and is ready to write code.

Pattern 5: More Tools Does Not Mean Better Performance

The study found a paradox: agents given additional search tools sometimes performed worse ². Excessive retrieval dilutes context with irrelevant files. Retrieval quality matters more than volume; precise file selection outperforms accuracy buried in noise ².

Codex CLI Defence: Curate Your MCP Stack

Resist the temptation to connect every available MCP server. Each tool definition consumes system prompt tokens ⁸. For a large codebase, a curated stack typically looks like:

# config.toml — curated MCP for large codebases
[mcp-servers.code-search]
command = "npx"
args = ["-y", "@anthropic-ai/sourcegraph-mcp"]
env = { SOURCEGRAPH_ACCESS_TOKEN = "sgp_..." }

[mcp-servers.git-context]
command = "npx"
args = ["-y", "mcp-git"]

Two servers — one for code intelligence, one for Git history — often outperform five or six miscellaneous tools. Every additional tool definition inflates the system prompt and increases the chance of the agent selecting the wrong tool for a given query.

The Economics: Context Infrastructure vs. Model Upgrades

The Sourcegraph findings carry a clear cost implication. Organisations cannot improve agent performance by upgrading to a more expensive frontier model if the bottleneck is retrieval infrastructure ⁶. The 30% cost reduction from proper tooling ² dwarfs the savings from switching between GPT-5.5 ($5/$30 per million tokens) ⁹ and o4-mini ($1.10/$4.40 per million tokens) ¹⁰.

graph TD
    subgraph "Cost Drivers in Large Codebases"
        A[Model Inference] -->|Most volatile| B[Agent retries & tool calls]
        C[Retrieval Infrastructure] -->|Fixed cost| D[Indexed code intelligence]
        E[Human Review Time] -->|Reduced by| F[Higher completion quality]
    end
    B -->|"With context: fewer retries"| G[30% cost reduction]
    D -->|"Precise results"| G
    style G fill:#6f6,color:#fff

The practical formula: invest in context infrastructure first, model selection second. A well-indexed codebase with Sourcegraph MCP and curated AGENTS.md file maps will outperform a frontier model navigating blind.

Configuration Checklist for Large Codebases

Failure Pattern	Codex CLI Mitigation	Config Key
400K LOC threshold	AGENTS.md file maps, scoped `writable_roots`	`writable_roots` in profile
Wrong code found	Sourcegraph or LSP MCP server	`[mcp-servers.sourcegraph]`
Half-finished refactoring	`--add-dir`, Stop hook verification	`[[hooks]]` event `on_agent_stop`
Tool thrashing	Early compaction, truncated tool output	`model_auto_compact_token_limit`, `tool_output_token_limit`
Tool overload	Curated MCP stack (2-3 servers max)	Remove unnecessary `[mcp-servers.*]`

Conclusion

The CodeScaleBench data makes the case unambiguously: above 400,000 lines of code, agent success depends on infrastructure, not intelligence. For Codex CLI users operating in enterprise-scale repositories, the highest-leverage investment is not a model upgrade — it is deterministic code search, architectural file maps in AGENTS.md, proactive context compaction, and a curated MCP server stack. The agents that navigate 22,000 files in five tool calls are not smarter; they are better equipped.

Citations

Stephanie Jarmak, quoted in Paul Sawers, “What 1,281 agent runs reveal about coding agent failure in large codebases,” Tessl.io, 20 May 2026. https://tessl.io/blog/coding-agent-failure-patterns-large-codebases/ ↩
Paul Sawers, “What 1,281 agent runs reveal about coding agent failure in large codebases,” Tessl.io, 20 May 2026. https://tessl.io/blog/coding-agent-failure-patterns-large-codebases/ ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹
Sourcegraph, “CodeScaleBench: Testing coding agents on large codebases and multi-repo software engineering tasks,” Sourcegraph Blog, 2026. https://sourcegraph.com/blog/codescalebench-testing-coding-agents-on-large-codebases-and-multi-repo-software-engineering-tasks ↩ ↩²
Sourcegraph, “MCP Server,” Sourcegraph, 2026. https://sourcegraph.com/mcp ↩
Sourcegraph, “Context Engineering: A Practical Guide for AI Agents (2026),” Sourcegraph Blog, 2026. https://sourcegraph.com/blog/context-engineering ↩ ↩²
Sourcegraph, “Agentic Coding in 2026: A Practical Guide for Big Code,” Sourcegraph Blog, 2026. https://sourcegraph.com/blog/agentic-coding ↩ ↩²
OpenAI, “Best practices — Codex,” OpenAI Developers, 2026. https://developers.openai.com/codex/learn/best-practices ↩
Daniel Vaughan, “MCP Schema Bloat and the System Prompt Tax,” Codex Knowledge Base, April 2026. https://codex.danielvaughan.com/2026/04/23/mcp-schema-bloat-system-prompt-tax-tool-definition-performance/ ↩
AWS, “Get started with OpenAI GPT-5.5, GPT-5.4 models, and Codex on Amazon Bedrock,” AWS News Blog, June 2026. https://aws.amazon.com/blogs/aws/get-started-with-openai-gpt-5-5-gpt-5-4-models-and-codex-on-amazon-bedrock/ ↩
OpenAI, “Pricing,” OpenAI, June 2026. https://openai.com/api/pricing/ ↩