Scored Improvement Loops with Codex CLI: Eval-Driven Iterative Problem-Solving

Some problems refuse to yield to a single prompt. Generating a production-quality SVG illustration, tuning a complex regex pipeline, or optimising a layout algorithm all share a common trait: they need many passes, each scored against concrete criteria, before the result is good enough to ship. Codex CLI’s scored improvement loop pattern addresses this head-on — combining deterministic evaluation scripts with LLM-as-judge rubric scoring to let the agent iterate autonomously until both quantitative and qualitative thresholds are met ¹.

This article walks through the pattern end-to-end: designing the evaluation harness, configuring Codex CLI for long-running scored sessions, interpreting the telemetry, and avoiding the pitfalls that cause loops to stall or regress.

Why Scored Loops Matter

The standard agent loop — plan, act, observe, reflect — works well for tasks with a clear binary success signal such as “tests pass” or “build succeeds” ². But many real-world problems have gradient success: a generated chart might score 62% on data accuracy while scoring only 40% on visual clarity. A refactored module might pass all tests yet violate half the team’s style conventions.

Scored improvement loops make that gradient explicit. Instead of the agent guessing whether its work is “done,” it runs an evaluation command after every iteration and compares the numbers against predefined thresholds ¹. The agent continues until both the deterministic score and the LLM-judge average exceed the target — commonly 90% ¹.

flowchart TD
    A[Read AGENTS.md + evaluation script] --> B[Run eval: capture baseline scores]
    B --> C{Both scores >= threshold?}
    C -- Yes --> D[Log final scores & stop]
    C -- No --> E[Identify lowest-scoring dimension]
    E --> F[Make one focused improvement]
    F --> G[Re-run eval command]
    G --> H[Log scores + delta]
    H --> C

Designing the Evaluation Harness

The evaluation harness is the backbone of the loop. OpenAI’s guidance recommends splitting checks into two complementary layers ¹:

Deterministic Checks

These are scripts that produce machine-readable scores for objectively measurable properties. Examples include constraint violation counts, pixel-distance metrics, test pass rates, and performance benchmarks.

#!/usr/bin/env bash
# eval.sh — deterministic scoring for a layout optimisation task
set -euo pipefail

SCORE_FILE="eval-results.json"

# Run layout validator
node validate-layout.js --output "$SCORE_FILE"

# Run performance benchmark
node benchmark.js --append "$SCORE_FILE"

echo "Deterministic eval complete. Results in $SCORE_FILE"

LLM-as-Judge Checks

For subjective qualities — readability, visual resemblance to a reference, prose quality — a second model scores the output against a rubric ¹ ³. The rubric should define explicit 0–100 scales for each dimension:

{
  "rubric": {
    "visual_fidelity": "How closely does the output match the reference image? 0=unrecognisable, 100=pixel-perfect",
    "code_readability": "How easy is the generated code to understand? 0=obfuscated, 100=exemplary",
    "constraint_adherence": "Does the output respect all stated constraints? 0=none, 100=all"
  }
}

The OpenAI eval-skills blog post recommends structuring LLM-judge output with --output-schema to guarantee parseable JSON ⁴:

{
  "type": "object",
  "properties": {
    "overall_pass": { "type": "boolean" },
    "score": { "type": "integer", "minimum": 0, "maximum": 100 },
    "checks": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "dimension": { "type": "string" },
          "score": { "type": "integer" },
          "reasoning": { "type": "string" }
        },
        "required": ["dimension", "score", "reasoning"],
        "additionalProperties": false
      }
    }
  },
  "required": ["overall_pass", "score", "checks"],
  "additionalProperties": false
}

Four-Category Success Model

The eval-skills framework suggests evaluating across four goal categories ⁴:

Category	What it measures	Example check
Outcome	Did the task complete?	Output file exists, valid JSON
Process	Did the agent follow the intended path?	Correct tools invoked, no unnecessary file writes
Style	Does the output follow conventions?	Naming, formatting, documentation
Efficiency	Was the path economical?	Command count, token usage, iteration count

Configuring Codex CLI for Scored Sessions

AGENTS.md Setup

Encode the evaluation contract directly in your AGENTS.md so the agent knows where the eval script lives and what the thresholds are:

## Evaluation Loop

- **Eval command:** `bash eval.sh`
- **Results file:** `eval-results.json`
- **Deterministic target:** overall_score >= 90
- **LLM-judge target:** average rubric score >= 90
- **Iteration discipline:** one focused change per cycle
- **Running log:** append each iteration to `PROGRESS.md`

Config Profile for Long-Running Loops

Scored loops tend to run for many iterations. Use a dedicated config profile with appropriate model and reasoning settings:

[profile.eval-loop]
model = "gpt-5.5"
reasoning_effort = "high"
approval_mode = "auto-edit"

[profile.eval-loop.features]
unified_exec = true

Launch with:

codex --profile eval-loop

For particularly long sessions, consider pairing with a PLANS.md file that defines milestones and checkpoints ⁵, enabling the agent to maintain coherence across dozens of iterations.

The Starter Prompt

OpenAI documents a recommended prompt pattern for initiating the loop ¹:

I have a difficult task in this workspace and I want you to run it as an eval-driven improvement loop. Before changing anything: Read AGENTS.md. Find the script or command that scores the current output. Then iterate: make one focused improvement, re-run the eval, log the scores and what changed, and inspect the generated artifacts. Continue until both the overall score and the LLM average are above 90%.

The key elements are: explicit reference to AGENTS.md, single-change discipline, mandatory logging, and dual-threshold stopping criteria.

Tracking Progress with JSONL Telemetry

When running scored loops via codex exec --json, each iteration emits structured events that you can pipe into monitoring dashboards ⁶:

codex exec --json \
  --profile eval-loop \
  "Run the eval-driven improvement loop as described in AGENTS.md" \
  2>telemetry.jsonl

The turn.completed events include token usage data — input_tokens, cached_input_tokens, and output_tokens — making it straightforward to track cost accumulation across iterations ⁶. With the v0.125.0 update, reasoning tokens are also reported, giving full visibility into where compute is spent ⁷.

sequenceDiagram
    participant Dev as Developer
    participant Codex as Codex CLI
    participant Eval as Eval Script
    participant Log as PROGRESS.md

    Dev->>Codex: Start scored loop prompt
    loop Until both scores >= 90%
        Codex->>Eval: Run eval.sh
        Eval-->>Codex: Scores + feedback
        Codex->>Log: Append iteration record
        Codex->>Codex: Identify weakest dimension
        Codex->>Codex: Make one focused change
    end
    Codex-->>Dev: Final scores + artefacts

Common Pitfalls and Mitigations

Pitfall	Symptom	Mitigation
Multi-change iterations	Scores fluctuate unpredictably	Enforce single-change discipline in AGENTS.md
Score plateau	Same score for 3+ consecutive iterations	Instruct agent to try a fundamentally different approach
Metric gaming	Deterministic score rises but output quality drops	Ensure LLM-judge rubric covers qualitative dimensions
Context exhaustion	Agent loses track of progress in long sessions	Use PROGRESS.md running log + PLANS.md milestones ⁵
Eval script fragility	Eval crashes on unexpected output format	Wrap eval in error handling; fail gracefully with score 0
Runaway cost	Token spend spirals over many iterations	Set a `max_turns` limit or monitor reasoning tokens via telemetry ⁷

When to Use Scored Loops

This pattern shines for specific problem shapes ¹:

Gradient-success tasks — where “better” is measurable but “done” is subjective
Visual or creative outputs — image generation, CSS layouts, chart design
Optimisation problems — performance tuning, compression ratios, algorithm refinement
Multi-constraint satisfaction — outputs must balance competing requirements

It is overkill for binary-outcome tasks (tests pass/fail, build succeeds/fails) where the standard agent loop already provides clear termination signals.

Headless CI Integration

Scored loops integrate naturally with CI pipelines via codex exec. A GitHub Actions workflow might run the loop on a schedule to optimise a performance-critical module:

- name: Run scored improvement loop
  uses: openai/codex-action@v1
  with:
    codex-args: "--profile eval-loop"
    prompt-file: ".codex/prompts/perf-optimise.md"
    sandbox: "workspace-write"
  env:
    OPENAI_API_KEY: $

Pair this with a step that parses eval-results.json and fails the job if scores regress below a floor, creating a ratchet that prevents quality from slipping ⁴.

Model Selection for Scored Loops

Phase	Recommended model	Rationale
Implementation changes	`gpt-5.5`	Complex reasoning for targeted improvements ⁸
LLM-as-judge scoring	`gpt-5.5` or `gpt-5.4`	Balanced accuracy for rubric evaluation ³
Quick deterministic checks	`gpt-5.3-codex-spark`	Fast, low-cost for script execution ⁸

⚠️ Using the same model as both implementer and judge can introduce self-preference bias. Where budget allows, consider using a different model for the judge step ³.

Conclusion

Scored improvement loops transform Codex CLI from a single-shot code generator into a persistent optimisation engine. The pattern is straightforward: define what “good” looks like with deterministic checks and LLM rubrics, enforce single-change iteration discipline, and let the agent grind through cycles until both scores clear the bar. For gradient-success problems that resist one-shot solutions, this is the most reliable approach available in the current agentic coding toolkit.

Citations

OpenAI, “Iterate on difficult problems,” Codex Use Cases, https://developers.openai.com/codex/use-cases/iterate-on-difficult-problems ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷
OpenAI, “Unrolling the Codex agent loop,” OpenAI Blog, https://openai.com/index/unrolling-the-codex-agent-loop/ ↩
SurePrompts, “LLM-as-Judge: A Practical Guide to Automating Prompt Evaluation (2026),” https://sureprompts.com/blog/llm-as-judge-prompting-guide ↩ ↩² ↩³
OpenAI, “Testing Agent Skills Systematically with Evals,” OpenAI Developers Blog, https://developers.openai.com/blog/eval-skills ↩ ↩² ↩³
OpenAI, “Using PLANS.md for multi-hour problem solving,” OpenAI Cookbook, https://developers.openai.com/cookbook/articles/codex_exec_plans ↩ ↩²
OpenAI, “Non-interactive mode,” Codex Documentation, https://developers.openai.com/codex/noninteractive ↩ ↩²
OpenAI, “Codex CLI Changelog — v0.125.0,” Codex Changelog, https://developers.openai.com/codex/changelog ↩ ↩²
OpenAI, “Models,” Codex Documentation, https://developers.openai.com/codex/models ↩ ↩²