Codex Cloud Exec Best-of-N: Running Multiple Solution Attempts and Picking the Winner

One of the quieter but most impactful features in Codex CLI’s cloud offering is the --attempts flag on codex cloud exec. First shipped in the June 2025 batch of cloud upgrades¹, it lets you ask Codex Cloud to run two, three, or four independent solution attempts for a single prompt — then pick the best result. Think of it as best-of-N sampling applied to entire agentic coding sessions, not just token generation.

This article covers the mechanics, the CLI workflow, the architecture behind parallel attempts, and practical patterns for integrating best-of-N into your development and CI/CD pipelines.

Why Best-of-N Matters for Agentic Coding

LLM-driven code generation is inherently stochastic. The same prompt can produce a clean, idiomatic refactor on one run and an over-engineered mess on the next. In interactive sessions you can steer the agent; in cloud exec — where you fire and forget — you cannot. Best-of-N addresses this directly: run the task multiple times in parallel, then select the attempt that best fits your criteria².

The approach is well-established in ML inference (rejection sampling, best-of-N RLHF reward scoring)³, but Codex Cloud applies it at the session level. Each attempt gets its own containerised environment, its own agent loop, and its own independent chain of tool calls. The results are genuinely independent, not just resampled final tokens.

The `--attempts` Flag

The syntax is straightforward⁴:

codex cloud exec --env ENV_ID --attempts 3 "Refactor the auth middleware to use JWTs"

Flag	Range	Default	Description
`--attempts`	1–4	1	Number of independent solution attempts
`--env`	string	—	Target Codex Cloud environment identifier (required)

Each attempt runs in an isolated container based on your environment configuration — Ubuntu 24.04 by default, with support for Python, Node.js, Rust, Go, Java, and a dozen other runtimes⁵. Your setup scripts execute independently in each container, so all attempts start from the same clean baseline.

The task metadata returned by codex cloud list --json includes an attempt_total field alongside the familiar id, status, url, and summary fields⁴.

The Review Workflow

After the attempts complete, you compare results using Codex Cloud’s diff and review commands:

flowchart LR
    A["codex cloud exec\n--attempts N"] --> B["N parallel\ncontainers"]
    B --> C1["Attempt 1\ncompletes"]
    B --> C2["Attempt 2\ncompletes"]
    B --> C3["Attempt N\ncompletes"]
    C1 --> D["codex cloud diff\n<TASK_ID>"]
    C2 --> D
    C3 --> D
    D --> E["Pick best\nattempt"]
    E --> F["codex cloud apply\n<TASK_ID>"]
    F --> G["Local working\ntree updated"]

Step 1 — Check Status

codex cloud list --env $ENV_ID --limit 5

Or for scripting:

codex cloud list --env $ENV_ID --json | jq '.tasks[] | {id, status, attempt_total, summary}'

Step 2 — Preview Each Diff

codex cloud diff <TASK_ID>

This previews the patch generated by a specific attempt before you apply anything locally⁶. For multi-attempt tasks, each attempt produces its own diff. You can review them in the web dashboard or via the CLI.

Step 3 — Apply the Winner

Once you have identified the best attempt:

codex cloud apply <TASK_ID>

There is also a top-level shortcut:

codex apply <TASK_ID>

This applies the selected diff to your local working tree⁶, letting you run your own test suite, linters, and integration checks before committing.

Architecture: Parallel Containerised Execution

Each attempt spins up an independent container in Codex Cloud’s infrastructure. The key architectural properties are:

Isolation: Attempts share no state. Each gets a fresh clone of your repository at the environment’s configured ref, with setup scripts run independently⁵.
Concurrency: Codex Cloud supports concurrent agent threads — currently up to six per environment⁷ — so multiple attempts can run in parallel rather than sequentially.
Deterministic baseline: All attempts start from the same commit and environment configuration, making the diffs genuinely comparable.
Independent agent loops: Each attempt runs its own full agent loop — reading files, executing shell commands, calling tools — so the approaches can diverge significantly².

graph TB
    subgraph "Codex Cloud"
        ENV["Environment\n(repo + setup scripts)"]
        ENV --> C1["Container 1\nAttempt 1"]
        ENV --> C2["Container 2\nAttempt 2"]
        ENV --> C3["Container 3\nAttempt 3"]
        C1 --> R1["Diff 1"]
        C2 --> R2["Diff 2"]
        C3 --> R3["Diff 3"]
    end
    R1 --> REV["Review &\nSelection"]
    R2 --> REV
    R3 --> REV
    REV --> LOCAL["Local\nWorking Tree"]

Practical Patterns

Pattern 1: Exploratory Refactoring

When you are unsure which approach is best for a non-trivial refactor, use --attempts 3 and let the agent explore different decompositions:

codex cloud exec --env $ENV_ID --attempts 3 \
  "Refactor the payment processing module to separate Stripe and PayPal into strategy classes. Ensure all existing tests pass."

Each attempt may choose different class hierarchies, different levels of abstraction, or different approaches to backwards compatibility. Review all three diffs, pick the cleanest one, apply it, and run your local test suite.

Pattern 2: CI Gate with Automated Selection

For CI/CD pipelines, you can script attempt selection based on objective criteria:

#!/usr/bin/env bash
set -euo pipefail

TASK_ID=$(codex cloud exec --env "$ENV_ID" --attempts 4 \
  "Fix the failing integration tests in tests/integration/" \
  --json | jq -r '.id')

# Wait for completion
while true; do
  STATUS=$(codex cloud status "$TASK_ID" --json | jq -r '.status')
  [ "$STATUS" = "completed" ] && break
  sleep 30
done

# Apply and test each attempt, keep the one that passes
codex cloud apply "$TASK_ID"
if make test-integration; then
  echo "Attempt passed integration tests"
  git add -A && git commit -m "fix: integration tests (cloud exec best-of-N)"
else
  echo "No passing attempt found"
  exit 1
fi

Pattern 3: Morning Batch with Best-of-N

Zack Proser’s workflow of queuing 3–5 Codex tasks before morning coffee² becomes more powerful with --attempts. For each task, request two attempts. By the time you are reviewing PRs, you have a choice of approaches for each:

# Queue morning batch
for task in "Update CHANGELOG for v3.2" "Add rate limiting to /api/users" "Migrate user table to new schema"; do
  codex cloud exec --env "$ENV_ID" --attempts 2 "$task" &
done
wait
echo "All tasks submitted — review after coffee"

Pattern 4: Plan Locally, Execute with Attempts in Cloud

The official Codex workflow documentation recommends a local-then-cloud pattern⁸: design and negotiate the plan interactively in your IDE, then delegate execution to the cloud. Adding --attempts to this pattern gives you insurance against poor execution of a good plan:

Open Codex locally in plan mode (/plan or Shift+Tab)
Negotiate the implementation approach until satisfied
Click the cloud icon or use codex cloud exec with the refined prompt
Add --attempts 2 so you get a backup if the first execution stumbles

Cost Considerations

Each attempt consumes its own compute and token budget. With --attempts 4, you are paying roughly four times the cloud cost of a single attempt. ⚠️ OpenAI has not published granular per-attempt pricing at the time of writing — costs are bundled into your Codex Cloud credit consumption, which varies by model (GPT-5.4 is recommended for most tasks)⁹ and task duration.

The trade-off is straightforward: for low-stakes tasks, a single attempt suffices. For high-value refactors or reliability-critical CI gates, the cost of two or three extra attempts is trivial compared to the developer time saved reviewing and manually fixing a poor first attempt.

Comparison with Other Approaches

Feature	Codex Cloud Best-of-N	Claude Code	Cursor
Parallel independent attempts	`--attempts 1-4`	Not available	Not available
Cloud-delegated execution	Yes — fire and forget	No (local only)	No (local only)
Diff preview before apply	`codex cloud diff`	N/A	N/A
Scriptable selection	JSON output + `jq`	N/A	N/A
Isolation level	Per-container	N/A	N/A

⚠️ This comparison reflects the state as of April 2026. Other tools may add similar capabilities.

Limitations and Known Issues

Maximum four attempts: The --attempts flag caps at 4⁴. For tasks where you want more diversity, you would need to submit multiple separate tasks.
No automatic ranking: Codex Cloud does not currently score or rank attempts for you — selection is manual (or scripted by you). The preview system in the Codex web dashboard surfaces all attempts side by side, but there is no built-in “pick the best” heuristic².
Environment warmup: Each attempt runs setup scripts independently, which can add latency for environments with heavy dependency installation. ⚠️ It is unclear whether Codex Cloud caches environment layers across attempts within the same task.
Model selection: You cannot currently choose which model handles your cloud task — Codex picks internally based on task complexity². The recommended model is GPT-5.4⁹.

Conclusion

Best-of-N via codex cloud exec --attempts is one of those features that quietly changes how you think about delegating work to an AI agent. It shifts the model from “hope the agent gets it right” to “let the agent explore and I will curate.” For senior developers already comfortable reviewing diffs, the workflow is natural: submit, compare, pick, apply. Combined with codex cloud diff and codex cloud apply, it integrates cleanly into existing development and CI/CD workflows without requiring changes to your local tooling.

The cap of four attempts keeps costs bounded, while the parallel containerised execution ensures you get genuinely independent approaches rather than minor variations. If you are using Codex Cloud for anything beyond trivial tasks, --attempts 2 should probably be your default.

Citations

OpenAI, “Introducing upgrades to Codex” (June 2025), openai.com/index/introducing-upgrades-to-codex ↩
Zack Proser, “OpenAI Codex Review 2026 — Updated from Daily Use”, zackproser.com/blog/openai-codex-review-2026 ↩ ↩² ↩³ ↩⁴ ↩⁵
Nakano et al., “WebGPT: Browser-assisted question-answering with human feedback” (2022), demonstrating best-of-N sampling for reward model selection ↩
OpenAI, “Command line options – Codex CLI”, developers.openai.com/codex/cli/reference ↩ ↩² ↩³
OpenAI, “CLI – Codex”, developers.openai.com/codex/cli ↩ ↩²
Blake Crosley, “Codex CLI: The Definitive Technical Reference”, blakecrosley.com/guides/codex; Toolsbase, “Codex CLI Cheat Sheet 2026”, toolsbase.dev/en/reference/codex-commands ↩ ↩²
OpenAI, “Features – Codex CLI”, developers.openai.com/codex/cli/features ↩
OpenAI, “Workflows – Codex”, developers.openai.com/codex/workflows ↩
OpenAI, “Models – Codex”, developers.openai.com/codex/models ↩ ↩²