Sketchnote diagram for: Legacy Code Archaeology with Codex CLI: Understanding, Documenting, and Safely Modernising Unfamiliar Codebases

Legacy Code Archaeology with Codex CLI: Understanding, Documenting, and Safely Modernising Unfamiliar Codebases

Every senior developer has faced it: a critical system written by people who left years ago, sparse documentation, no tests, and a business requirement to change it without breaking anything. AI coding agents have transformed this challenge. Gartner predicted that 40% of legacy modernisation projects would incorporate AI-assisted reverse engineering by 2026, up from under 10% in 2023¹. Codex CLI is particularly well-suited for this work because it operates directly in your terminal, reads your actual codebase, and can run commands to verify its understanding — all within a sandboxed environment that prevents accidental damage.

This article presents a structured methodology for using Codex CLI to explore, document, and incrementally modernise legacy codebases, drawing on OpenAI’s official code modernisation cookbook² and the codebase onboarding workflow³.

The Three Phases of Code Archaeology

Legacy code work divides into three distinct phases, each with different Codex CLI techniques. Attempting to skip phases — jumping straight to refactoring without understanding, for example — is the primary cause of modernisation failures⁴.

graph LR
    A[Phase 1: Exploration] --> B[Phase 2: Documentation]
    B --> C[Phase 3: Incremental Modernisation]
    A -->|"@ mentions, /plan"| A
    B -->|"ExecPlan, AGENTS.md"| B
    C -->|"Parity tests, /review"| C

Phase 1: Exploration — Mapping the Unknown

Initial Orientation

Start with a broad exploration prompt. Codex reads the project structure, identifies key entry points, and provides a high-level map. The official codebase onboarding workflow recommends this pattern³:

codex --model gpt-5.5 --sandbox read-only

Using read-only sandbox mode is critical during exploration — it prevents Codex from modifying anything while it investigates. Once inside the session:

Explain how the request flows through this codebase. Include:
which modules own what, where data is validated, and top gotchas
to watch for before making changes. End with the files I should
read next.

Targeted Deep Dives with @ Mentions

Once you have the high-level map, use @ mentions to drill into specific modules without wasting context tokens on irrelevant files⁵:

@src/billing/invoice_generator.py @src/billing/tax_calculator.py
Trace the invoice generation flow end to end. Identify:
1. Where business rules are encoded (vs configuration)
2. Which external services are called and what happens when they fail
3. Any implicit state dependencies between these modules

Plan Mode for Complex Systems

For particularly tangled codebases, /plan mode helps structure the exploration before committing to any approach⁵:

/plan
Map the dependency graph of the payment processing subsystem.
Identify circular dependencies, God classes, and modules that
violate single responsibility. Produce a prioritised list of
modules ranked by coupling and change frequency.

Plan mode presents Codex’s exploration strategy before execution, letting you redirect it if it is heading down an unproductive path. Since v0.124.0, you can start implementation in a fresh context after planning, preserving your token budget for the actual work⁶.

Running Existing Tests for Context

Codex can execute existing test suites to understand what the codebase actually does versus what the (possibly outdated) documentation claims:

Run the existing test suite and summarise:
- Which modules have test coverage and which do not
- What the tests actually verify (happy path only? edge cases?)
- Any tests that are currently failing or skipped

This is safe under workspace-write sandbox mode since test execution is read-only in practice. The results tell you which parts of the codebase have a safety net and which are flying blind.

Phase 2: Documentation — Creating the Missing Map

The AGENTS.md Bootstrap

Before generating documentation, create an AGENTS.md that encodes what you have learned so far. This prevents Codex from losing context across sessions and ensures other team members (or agents) benefit from your exploration⁵:

codex /init

Then refine it with legacy-specific guidance:

Read the directory structure and refine AGENTS.md so it covers:
- The legacy stack and its conventions (naming, file layout, build system)
- Known tribal knowledge: which modules are fragile, which are stable
- Build, test, and lint commands
- Files and directories that must not be modified without approval

The ExecPlan Pattern

OpenAI’s code modernisation cookbook² introduces the ExecPlan — a lightweight planning document that structures the entire modernisation effort. Create .agent/PLANS.md to define the format, then ask Codex to produce the first plan:

Create pilot_execplan.md following .agent/PLANS.md. Scope it to
the billing module. The plan should cover four outcomes:
- Inventory and diagrams
- Modernisation Technical Report content
- Target design and spec
- Test plan for parity
Use the ExecPlan skeleton with concrete references to actual
legacy files.

Generating the Inventory

The inventory phase is where AI comprehension delivers its largest return. Thoughtworks’ CodeConcise tool demonstrated a 66% reduction in reverse engineering time for COBOL modules — from six weeks to two per module⁷. Codex CLI achieves similar acceleration for any language it can read:

Create billing_overview.md with "Inventory" and "Modernisation
Technical Report" sections.

Inventory includes:
1. All source files grouped by type (models, services, utilities, config)
2. Data sources read and written (databases, files, APIs)
3. External service dependencies with failure modes

Technical Report describes:
1. Business scenario in plain English
2. Detailed behaviour of each core module
3. Data model with field names and types
4. Technical risks: date handling, rounding, error codes, encoding

Mermaid Diagrams for Architecture Visualisation

Ask Codex to produce architecture diagrams in Mermaid format. These render directly in GitHub, GitLab, and most documentation systems:

graph TD
    subgraph Legacy Billing
        A[invoice_generator.py] -->|reads| B[(orders_db)]
        A -->|calls| C[tax_calculator.py]
        C -->|calls| D[Tax API v2]
        A -->|writes| E[(invoices_db)]
        A -->|calls| F[pdf_renderer.py]
        F -->|writes| G[/shared/invoices/]
    end
    subgraph Modern Target
        H[BillingService] -->|reads| I[(orders)]
        H -->|delegates| J[TaxService]
        J -->|calls| K[Tax API v3]
        H -->|writes| L[(invoices)]
        H -->|emits| M[InvoiceCreated event]
    end

Phase 3: Incremental Modernisation — The Strangler Fig Approach

Why Incremental Matters

The strangler fig pattern — coined by Martin Fowler in 2004 — replaces legacy systems incrementally rather than through risky big-bang rewrites⁸. A façade routes traffic between old and new implementations, allowing each migrated module to be validated independently. This approach maps naturally onto Codex CLI workflows.

Validation-First Implementation

The cookbook emphasises scaffolding tests before implementation². This is the single most important discipline for safe legacy modernisation:

Using billing_validation.md, create tests/billing_parity_test.py.
Include placeholder assertions referencing test scenarios from the
validation plan. Do not assume the modern implementation exists yet.

Once tests are in place, ask Codex to generate the initial implementation:

Using billing_design.md and legacy programs from billing_overview.md,
generate implementation code under modern/billing/ that:
- Defines domain models for invoice records
- Implements core business logic in service classes
- Preserves legacy behaviour exactly (parity first, improvements later)
- Comments reference original legacy modules and line numbers

The Parity Test Loop

After initial generation, use an iterative loop to converge on correct behaviour:

graph LR
    A[Run parity tests] -->|Failures| B[Show failing test + legacy code to Codex]
    B --> C[Codex explains difference]
    C --> D[Codex proposes minimal fix]
    D --> A
    A -->|All pass| E[Review with /review]
    E --> F[Commit increment]

The prompt for each iteration:

Here is a failing parity test and the relevant legacy and modern code.
Explain why outputs differ and propose the smallest change to align
modern code with legacy behaviour. Show updated code and test adjustments.

Using /review for Safety Gates

Before committing any modernised module, run Codex’s built-in review⁵:

/review Focus on: behavioural parity with legacy, error handling
completeness, and any assumptions about data formats that may not hold.

The v0.124.0 auto-review system can route eligible approval prompts through a reviewer agent automatically, adding a second pair of AI eyes before changes land⁶.

Configuration for Legacy Work

config.toml for Exploration Sessions

[codex]
model = "gpt-5.5"
model_reasoning_effort = "high"
approval_mode = "suggest"

# Legacy work benefits from high reasoning for accurate comprehension
[profiles.legacy-explore]
model_reasoning_effort = "high"
sandbox = "read-only"

[profiles.legacy-implement]
model_reasoning_effort = "medium"
sandbox = "workspace-write"

Use codex -p legacy-explore for Phase 1–2 and codex -p legacy-implement for Phase 3⁹.

AGENTS.md Template for Legacy Projects

# AGENTS.md — Legacy Modernisation

## Project Context
This is a legacy [language/framework] codebase dating from [year].
Original documentation is [sparse/missing/outdated].

## Conventions
- [Language-specific conventions observed in the codebase]
- [Naming patterns, file layout, build system details]

## Safety Rules
- NEVER modify files under src/legacy/ without explicit approval
- ALWAYS run parity tests after any change to modern/ code
- NEVER remove legacy code until parity tests pass at 100%

## Build & Test
- Legacy build: [command]
- Modern build: [command]
- Parity tests: [command]
- Full test suite: [command]

## Known Risks
- [Module X uses implicit global state — changes cascade unpredictably]
- [Date handling mixes timezone-aware and naive datetimes]
- [Tax calculations use banker's rounding — do not change to standard rounding]

Accuracy Considerations

AI-assisted code comprehension is not infallible. Research indicates 85–95% accuracy for straightforward CRUD operations, dropping to 80–85% for complex business logic⁷. Financial and regulatory code requires 100% subject-matter expert validation regardless of AI confidence.

Practical guardrails:

Always verify AI-generated documentation against runtime behaviour — run the code, inspect database queries, check API calls
Use /compact strategically — long exploration sessions accumulate context, but compaction may discard nuances discovered early in the session⁵
Cross-reference with git history — git log --follow on key files reveals why code was written this way, not just what it does
Mark uncertain claims — instruct Codex to flag low-confidence assertions with ⚠️ so reviewers know where to focus validation effort

Scaling the Pattern

Once one module is successfully modernised, the ExecPlan becomes a reusable template²:

Using the billing pilot files, create template_modernisation_execplan.md
that teams can copy when modernising other modules. Include:
- Compliance with .agent/PLANS.md
- Placeholders for Overview, Inventory, MTR, Design, Validation
- Assumption of the same pattern: overview → design → validation → tests → implementation

This is where the compound value of Codex CLI’s memory system becomes apparent. With memories = true in config.toml, stable patterns — naming conventions, common failure modes, preferred test structures — persist across sessions without re-prompting¹⁰.

When to Use Codex CLI vs. Codex Cloud

Scenario	Surface	Rationale
Initial exploration of a private codebase	CLI	Code stays local, sandbox prevents accidents
Parallel inventory of multiple modules	Cloud	Use `--attempts` for best-of-N module summaries
Implementing parity-tested replacements	CLI	Tight feedback loop with local test execution
Reviewing modernised code before merge	CLI or GitHub	`/review` locally, or `@codex review` on PRs

Key Takeaways

Legacy code archaeology with Codex CLI follows a disciplined three-phase approach: explore in read-only mode to understand before changing anything, document exhaustively with ExecPlans and inventories, and modernise incrementally with parity tests as the safety net. The temptation to skip straight to rewriting is strong — resist it. The documentation phase is where AI comprehension delivers its greatest value, turning months of reverse engineering into days of structured exploration.

⚠️ AI-generated code comprehension should always be validated against actual runtime behaviour, particularly for business-critical logic involving financial calculations, regulatory compliance, or implicit state dependencies.

Citations

Gartner, “Predicts 2024: AI Will Transform Software Engineering”, cited in Aspire Systems, “AI in Reverse Engineering Legacy Code” ↩
OpenAI, “Modernizing your Codebase with Codex”, OpenAI Cookbook, 2026 ↩ ↩² ↩³ ↩⁴
OpenAI, “Understand large codebases — Codebase Onboarding”, Codex Developer Docs, 2026 ↩ ↩²
Coder, “AI-Assisted Legacy Code Modernization: A Developer’s Guide”, 2026 ↩
OpenAI, “Best practices”, Codex Developer Docs, 2026 ↩ ↩² ↩³ ↩⁴ ↩⁵
OpenAI, “Changelog — CLI 0.124.0”, April 2026 ↩ ↩²
SoftwareSeni, “Cutting Legacy Reverse Engineering Time by 66% with AI Code Comprehension”, 2026 ↩ ↩²
Martin Fowler, “StranglerFigApplication”, 2004 ↩
OpenAI, “Command line options”, Codex CLI Reference, 2026 ↩
OpenAI, “Memories”, Codex Developer Docs, 2026 ↩