Codex CLI for Automated Test Maintenance: Fixing Broken Tests, Updating Snapshots, and Eliminating Flaky Tests

Test suites decay. A team that starts with 200 tests and 10% maintenance overhead reaches 1,000 tests and 50% maintenance overhead — a ceiling where keeping tests green costs more than the safety they provide¹. QA engineers report spending 20–30% of their working week triaging failures that have nothing to do with production bugs². Flaky tests account for up to 30% of all test failures³, and at scale, a team deploying eight times per month with twelve tests breaking per deploy burns roughly $67,200 annually on maintenance alone².

Codex CLI transforms test maintenance from a manual grind into an agent-driven workflow. This article covers three patterns: automated test repair in CI, intelligent snapshot management, and flaky test detection with quarantine — all using Codex CLI v0.135’s current tooling.

The Test Maintenance Problem

Test maintenance is not test writing. Writing new tests is creative work. Maintenance is reactive drudgery: a renamed API field breaks 40 assertions; a CSS refactor invalidates 15 snapshots; a timing-dependent test passes locally but fails in CI. The fix is usually trivial but finding it requires context-switching from feature work.

flowchart TD
    A[Test Failure in CI] --> B{Root Cause?}
    B -->|Code Change| C[Intentional Behaviour Change]
    B -->|Environment| D[Flaky / Non-deterministic]
    B -->|Dependency| E[Upstream API/Schema Drift]
    C --> F[Update Test Assertions]
    D --> G[Quarantine + Fix Root Cause]
    E --> H[Update Mocks/Contracts]
    F --> I[PR with Fix]
    G --> I
    H --> I

Codex CLI handles all three branches. The key insight: AGENTS.md constraints ensure the agent fixes the test, not the implementation, unless explicitly instructed otherwise.

Pattern 1: Automated Test Repair in CI

The OpenAI Cookbook documents a GitHub Actions workflow that triggers Codex when CI fails, generates a minimal fix, and opens a pull request⁴. Here is a production-hardened version:

name: Codex Test Autofix
on:
  workflow_run:
    workflows: ["CI"]
    types: [completed]

jobs:
  autofix:
    if: ${{ github.event.workflow_run.conclusion == 'failure' }}
    runs-on: ubuntu-latest
    permissions:
      contents: write
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
        with:
          ref: ${{ github.event.workflow_run.head_branch }}

      - uses: actions/setup-node@v4
        with:
          node-version: 20

      - run: npm ci

      - uses: openai/codex-action@main
        with:
          codex_args: '["--config","sandbox_mode=\"workspace-write\""]'
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          CODEX_INSTRUCTION: |
            Run the test suite. Identify failing tests. For each failure:
            1. Determine whether the test expectation is outdated or the implementation is wrong.
            2. If the test expectation is outdated, update the test.
            3. If the implementation is wrong, stop and report — do not fix production code.
            4. Never delete tests. Never weaken assertions.
            Implement only the minimal change needed. Stop after all tests pass.

      - uses: peter-evans/create-pull-request@v6
        with:
          commit-message: "fix(tests): auto-repair failing tests via Codex"
          branch: codex/autofix-${{ github.event.workflow_run.head_branch }}
          title: "fix(tests): automated test repair"
          body: |
            Codex CLI identified and repaired failing tests.
            **Review carefully** — agent did not modify production code.

The critical constraint is the instruction: never fix production code. Without this, agents will happily “fix” a failing test by changing the function under test⁵.

AGENTS.md for Test Repair

Encode test maintenance boundaries permanently in your repository:

<!-- tests/AGENTS.md -->
# Test Maintenance Rules

## Boundaries
- NEVER modify files outside `tests/` or `__tests__/` directories
- NEVER weaken assertions (e.g. replacing `.toBe(42)` with `.toBeTruthy()`)
- NEVER delete test cases — quarantine them with `.skip` if genuinely obsolete
- If a test failure indicates a production bug, STOP and report the finding

## Test Commands
- Unit tests: `npm test` (Jest) or `npx vitest run`
- E2E tests: `npx playwright test`
- Type checks: `npx tsc --noEmit`

## Conventions
- Snapshot updates require running with `--update-snapshot` flag
- Flaky tests must be tagged with `// @flaky` comment before quarantining
- All test fixes must preserve the original test's intent

AGENTS.md guidance alone achieves roughly 25–40% compliance from agents; the same rules enforced as runtime hooks hit closer to 95%⁶.

Pattern 2: Intelligent Snapshot Management

Snapshot tests are the highest-maintenance category. A single component refactor can invalidate dozens of .snap files. The naive approach — running jest --updateSnapshot — accepts all changes blindly. Codex CLI enables a selective approach:

codex exec "Run 'npx vitest run' and identify snapshot failures. \
For each failing snapshot: \
1. Read the component source that generates the snapshot. \
2. Determine if the new output is correct given the source. \
3. If correct, update that specific snapshot. \
4. If incorrect, report the discrepancy without updating. \
Output a summary of updated vs flagged snapshots." \
  --output-schema ./schemas/snapshot-report.json

The --output-schema flag produces machine-readable JSON for downstream tooling⁷:

{
  "type": "object",
  "properties": {
    "updated": {
      "type": "array",
      "items": { "type": "string" }
    },
    "flagged": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "file": { "type": "string" },
          "reason": { "type": "string" }
        }
      }
    },
    "total_failures": { "type": "integer" }
  }
}

PostToolUse Hook for Snapshot Validation

Prevent blind snapshot acceptance with a hook that verifies each update:

# config.toml
[[hooks]]
event = "PostToolUse"
tool_name = "shell_command"
command = """
if echo "$TOOL_INPUT" | grep -q "updateSnapshot\\|--update"; then
  echo "⚠️ Snapshot update detected — verifying diff is intentional" >&2
  git diff --stat "**/*.snap" >&2
fi
"""

This surfaces snapshot changes during the agent’s execution, making unintended updates visible in the session log.

Pattern 3: Flaky Test Detection and Quarantine

Flaky tests — those that pass or fail non-deterministically — are the most insidious maintenance burden. The 2026 consensus is clear: quarantine, do not retry⁸. Retries discard signal; quarantine preserves it.

Detection with Codex Exec

Run a flakiness scan as a scheduled automation:

codex exec "Analyse the test suite for flakiness indicators: \
1. Find tests using setTimeout, Date.now(), Math.random(), or network calls without mocks. \
2. Find tests that reference shared mutable state across test cases. \
3. Find tests with race conditions (async operations without proper awaits). \
4. Check git history for tests that have been retried or skipped in the last 30 days. \
Report each finding with file, line, and recommended fix." \
  --model gpt-5.4-mini \
  -o /tmp/flaky-report.json \
  --output-schema ./schemas/flaky-report.json

Using gpt-5.4-mini keeps costs low for this analytical task — no code generation required⁹.

Quarantine Workflow

Once detected, quarantine flaky tests without losing visibility:

flowchart LR
    A[Flaky Test Detected] --> B[Tag with @flaky]
    B --> C[Move to Quarantine Suite]
    C --> D[Non-blocking CI Job]
    D --> E{Passes 10/10?}
    E -->|Yes| F[Promote Back to Main Suite]
    E -->|No| G[Create Fix Ticket]
    G --> H[Codex Attempts Repair]
    H --> I{Fixed?}
    I -->|Yes| F
    I -->|No| J[Escalate to Human]

For Playwright, the quarantine pattern uses tag-based filtering⁸:

// playwright.config.ts
export default defineConfig({
  projects: [
    {
      name: 'stable',
      grep: /^(?!.*@flaky)/,  // exclude @flaky tagged tests
    },
    {
      name: 'quarantine',
      grep: /@flaky/,
      retries: 3,  // only retry quarantined tests
    },
  ],
});

Automated Quarantine with a PreToolUse Hook

Block agents from retrying flaky tests instead of fixing them:

[[hooks]]
event = "PreToolUse"
tool_name = "shell_command"
command = """
if echo "$TOOL_INPUT" | grep -qE "retryTimes|--retries.*[3-9]"; then
  echo "BLOCKED: Do not add retries to mask flakiness. Fix the root cause or quarantine the test." >&2
  exit 2
fi
"""

Exit code 2 blocks the tool call and feeds the reason back to the model¹⁰.

Combining Patterns: The Test Health Automation

Wire all three patterns into a weekly scheduled automation:

#!/bin/bash
# scripts/test-health.sh — run via cron or GitHub Actions schedule

set -euo pipefail

# 1. Run full suite, capture failures
npx vitest run --reporter=json > /tmp/test-results.json 2>&1 || true

# 2. Let Codex analyse and fix
codex exec "Read /tmp/test-results.json. \
Categorise each failure as: intentional-change, flaky, or upstream-drift. \
For intentional-change: update the test assertion. \
For flaky: add a @flaky tag and move to quarantine config. \
For upstream-drift: update mocks to match current API contracts. \
Never modify production source files. \
Commit each category as a separate commit with conventional commit messages." \
  --config sandbox_mode="workspace-write"

# 3. Push results
git push origin HEAD

Cost Considerations

Test maintenance tasks are typically low-reasoning-effort work — pattern matching against error messages and updating string literals. Route these to gpt-5.4-mini with model_reasoning_effort = "low" to minimise spend⁹. Reserve gpt-5.5 for complex flakiness root-cause analysis where the agent must reason about concurrency or timing.

A typical maintenance run fixing 5–10 broken assertions consumes 8,000–15,000 tokens (including test file context), costing approximately $0.02–0.05 at current rates¹¹.

Limitations and Safety

False confidence: An agent updating a test assertion may mask a genuine regression. Always review auto-generated PRs before merging.
Context ceiling: Large test files (>500 lines) approach compaction territory. Split monolithic test files for better agent performance.
Snapshot semantics: Codex cannot evaluate visual correctness of rendered component snapshots — it reasons about structure, not pixels. Use visual regression tools (Percy, Chromatic) for visual assertions¹².
Shared state: Agents struggle with test isolation bugs caused by global singletons or database state leakage between tests.

Citations

Ali El-Shayeb, “The hidden test automation maintenance cost consuming 50% of QA time,” QA meets AI (Medium), May 2026. https://medium.com/qa-flow/the-hidden-test-automation-maintenance-cost-consuming-50-of-qa-time-a8a462cd9084 ↩
Diffie, “The True Cost of Test Maintenance (And How to Cut It),” 2026. https://diffie.ai/blog/true-cost-of-test-maintenance ↩ ↩²
ACCELQ, “Flaky Tests in 2026 – How to Identify, Fix, and Prevent Them,” 2026. https://www.accelq.com/blog/flaky-tests/ ↩
OpenAI, “Use Codex CLI to automatically fix CI failures,” OpenAI Cookbook, 2026. https://developers.openai.com/cookbook/examples/codex/autofix-github-actions ↩
Claude Skills Hub, “Auto Fix Tests Skill,” 2026. https://claudeskills.info/skill/fix-tests/ — Documents the pattern of fixing tests without modifying business logic. ↩
OpenAI, “Custom instructions with AGENTS.md,” Codex Developers, 2026. https://developers.openai.com/codex/guides/agents-md ↩
OpenAI, “Non-interactive mode,” Codex Developers, 2026. https://developers.openai.com/codex/noninteractive — Documents --output-schema for structured JSON output. ↩
Trunk.io, “How to avoid and detect flaky tests in Vitest,” 2026. https://trunk.io/blog/how-to-avoid-and-detect-flaky-tests-in-vitest ↩ ↩²
OpenAI, “Models – Codex,” Codex Developers, 2026. https://developers.openai.com/codex/models — Model selection guidance: gpt-5.4-mini for lighter tasks. ↩ ↩²
OpenAI, “Hooks,” Codex Developers, 2026. https://developers.openai.com/codex/hooks — Exit code 2 blocks tool execution and provides feedback. ↩
OpenAI, “Pricing – Codex,” Codex Developers, 2026. https://developers.openai.com/codex/pricing ↩
BrowserStack, “How to Detect and Avoid Playwright Flaky Tests in 2026,” 2026. https://www.browserstack.com/guide/playwright-flaky-tests ↩