Sketchnote diagram for: The Silent Guessing Problem: Why AI Coding Agents Don't Ask Clarifying Questions and What AMBIG-SWE Means for Codex CLI

The Silent Guessing Problem: Why AI Coding Agents Don’t Ask Clarifying Questions and What AMBIG-SWE Means for Codex CLI

A Carnegie Mellon research team has published one of the most practically important findings for agentic coding in 2026: when given ambiguous instructions, AI coding agents almost never ask for clarification. They guess silently — and get it wrong far more often than they should. The AMBIG-SWE paper, accepted at ICLR 2026¹, quantifies this problem and offers a framework that every Codex CLI user should understand.

The Core Finding: Agents Default to Silence

Vijayvargiya et al. constructed a benchmark of 500 issues derived from SWE-Bench Verified, synthetically reducing detail to simulate the kind of underspecified bug reports and feature requests that arrive in real-world repositories¹. The results are striking: when agents were given these ambiguous instructions in a non-interactive setting, resolve rates dropped to as low as 37.94% for Claude Sonnet 3.5, compared to 59.52% when the same model could ask clarifying questions — a 57% relative recovery¹.

The most damaging finding is behavioural, not architectural: models default to non-interactive behaviour unless explicitly prompted to ask questions¹. Without strong encouragement, agents almost never interact, even when instructions are severely underspecified. Qwen 3 Coder achieved a 100% false negative rate across all prompt configurations — it never once detected that it lacked sufficient information¹.

The Three-Capability Framework

AMBIG-SWE decomposes the problem into three capabilities that agents need but largely lack¹:

flowchart LR
    A["1. Detection\nRecognise ambiguity"] --> B["2. Questioning\nAsk targeted questions"]
    B --> C["3. Integration\nUse answers effectively"]
    style A fill:#f9f,stroke:#333
    style B fill:#bbf,stroke:#333
    style C fill:#bfb,stroke:#333

Detection — can the agent identify when instructions are incomplete? Claude Sonnet 4 achieved 89% accuracy with strong encouragement, but dropped to 74% without it¹.
Questioning — can it ask the right questions? Claude Sonnet 4 obtained equivalent information gains to Qwen 3 Coder using 50% fewer questions, by employing an “exploration-first” strategy — examining the codebase before asking the user¹.
Integration — can it actually use the answers? Even when models received navigation hints, Qwen 3 Coder’s resolve rate declined from 47.60% to 40.20%, suggesting some models cannot effectively integrate new information¹.

Why This Matters for Codex CLI Users

Every Codex CLI session operates on a spectrum from fully interactive (the TUI) to fully non-interactive (codex exec). The AMBIG-SWE findings map directly onto this spectrum.

The Interactive Gap

In a standard interactive session, Codex CLI can theoretically pause and ask you questions. In practice, GPT-5.3-Codex and GPT-5.4 share the same behavioural bias the paper identifies: they prefer to guess and proceed rather than interrupt your flow with clarifying questions². This is partly by design — the official Codex system prompt includes an “autonomy directive” that biases the agent toward action³ — and partly an artefact of RLHF training that rewards task completion over collaborative disambiguation.

The Non-Interactive Cliff

For codex exec users — CI/CD pipelines, background automation, scheduled tasks — the problem is structural. Non-interactive mode cannot ask questions at all⁴. Every ambiguous instruction becomes a silent guess. The AMBIG-SWE data suggests this costs you roughly 20 percentage points of resolve rate compared to what an interactive, clarification-seeking agent could achieve¹.

Practical Mitigation Patterns

The good news: Codex CLI already provides mechanisms to address each capability in the AMBIG-SWE framework, even if they were not designed with this paper in mind.

Pattern 1: Plan Mode as Forced Disambiguation

OpenAI’s own best practices guide explicitly recommends plan mode for ambiguous tasks: “Plan mode lets Codex gather context, ask clarifying questions, and build a stronger plan before implementation”⁵. Toggling /plan or pressing Shift+Tab activates a mode where the agent must articulate its understanding before acting.

This directly addresses AMBIG-SWE’s detection and questioning capabilities. By forcing the agent to produce a plan, you create a natural checkpoint where missing information becomes visible:

# In the TUI, start with plan mode for ambiguous tasks
codex
# Then type: /plan
# Then provide your ambiguous task description

The plan output reveals the agent’s assumptions. If it assumed the wrong database, the wrong branch, or the wrong test framework, you catch it before any code is written.

Pattern 2: The Interview Pattern

The best practices guide also recommends a pattern that mirrors AMBIG-SWE’s ideal interactive agent: “Ask Codex to interview you. If you have a rough idea of what you want but aren’t sure how to describe it well, ask Codex to question you first”⁵.

In practice, this means prefixing your prompt:

Before implementing anything, ask me 3-5 clarifying questions about
the requirements. I want to fix the authentication flow in the
user service, but I'm not sure about the exact scope.

This inverts the default behaviour: instead of the agent guessing silently, it must explicitly surface its uncertainties.

Pattern 3: AGENTS.md as Ambient Disambiguation

The AMBIG-SWE paper found that exploration-first strategies — examining the codebase before asking — produced more efficient questioning¹. AGENTS.md serves as pre-emptive disambiguation by encoding the answers to common questions before the agent even starts:

# AGENTS.md

## Architecture
- Authentication uses OAuth 2.0 PKCE flow via `/auth` service
- Database is PostgreSQL 16 with pgvector extension
- All API responses follow RFC 7807 Problem Details format

## Conventions
- Tests use pytest with Testcontainers — never mock the database
- Branch naming: `feat/JIRA-123-short-description`
- PRs require at least one approval before merge

## Constraints
- NEVER modify migration files after they have been applied
- All new endpoints MUST have OpenAPI annotations
- If a task is ambiguous, output your assumptions and stop

That last line — “If a task is ambiguous, output your assumptions and stop” — is the AGENTS.md equivalent of AMBIG-SWE’s “strong encouragement” prompt. It explicitly instructs the agent to detect and surface ambiguity rather than guessing through it⁶.

Pattern 4: Structured Prompts for codex exec

For non-interactive pipelines where the agent cannot ask questions, the solution is to eliminate ambiguity at the prompt level. OpenAI recommends a four-component prompt structure⁵:

codex exec --full-auto "$(cat <<'PROMPT'
Goal: Add rate limiting to the /api/users endpoint
Context: The rate limiter should use Redis (already running on
  localhost:6379). See src/middleware/rateLimit.ts for the existing
  pattern used on /api/auth.
Constraints: Use the sliding window algorithm. Maximum 100 requests
  per minute per API key. Follow the existing middleware pattern exactly.
Done when: All existing tests pass AND new tests in
  src/__tests__/rateLimit.test.ts cover the happy path, over-limit
  response (429), and window reset.
PROMPT
)"

Each component maps to AMBIG-SWE’s framework: Context provides the information the agent would otherwise need to ask about; Constraints eliminate the ambiguity space; Done when provides verification criteria that remove subjective interpretation.

Pattern 5: MCP Elicitation for Custom Servers

Codex CLI v0.119.0 introduced MCP elicitation — the ability for custom MCP servers to interrupt an agent turn and request structured user input via mcpServer/elicitation/request⁷. This is the closest Codex CLI comes to the interactive clarification loop that AMBIG-SWE demonstrates:

sequenceDiagram
    participant Agent as Codex Agent
    participant MCP as Custom MCP Server
    participant User as Developer
    Agent->>MCP: Tool call (ambiguous parameters)
    MCP->>User: Elicitation request (structured form)
    User->>MCP: Response (selected option)
    MCP->>Agent: Tool result (disambiguated)

For teams building custom MCP servers, this provides a mechanism to enforce clarification at tool boundaries rather than relying on the model’s self-awareness of ambiguity.

Pattern 6: Steer Mode for Mid-Turn Correction

When an agent has already started executing but you notice it has silently guessed wrong, steer mode (press Enter to interrupt immediately, Tab to queue for next turn) provides real-time correction⁸. This is a reactive version of AMBIG-SWE’s interaction loop — you supply the missing information after observing the agent’s behaviour rather than before.

The combination of plan mode (proactive) and steer mode (reactive) covers both ends of the disambiguation timeline.

The Exploration-First Lesson

AMBIG-SWE’s most actionable finding for Codex CLI practitioners is the exploration-first strategy: Claude Sonnet 4 achieved the same information gains as Qwen 3 Coder using half the questions by first examining the codebase and then asking targeted questions about what remained unclear¹.

This maps directly to Codex CLI’s Auto approval mode, where the agent can read files freely before asking permission to write⁹. An AGENTS.md instruction that encourages this pattern:

## Workflow
When given an ambiguous task:
Read all relevant files in the affected directories first
Check existing tests for behavioural expectations
Review recent git log for context on recent changes
THEN propose your plan — listing any assumptions you are making
Wait for confirmation before implementing

This encodes the exploration-first strategy as a repeatable workflow. The agent gathers context from the codebase (which costs no human time) and only then surfaces its remaining uncertainties (which requires human attention).

The Bigger Picture: Interactive Agents as a Design Goal

The AMBIG-SWE paper concludes that dedicated training approaches are needed rather than relying on prompt engineering alone¹. Current models were not trained to detect and act on ambiguity — they were trained to complete tasks. Until model providers address this at the training level, the burden falls on practitioners to create environments that compensate for the agent’s bias toward silent guessing.

For Codex CLI users, this means:

Default to plan mode for any task you cannot specify in four precise sentences
Write AGENTS.md files that answer the questions your agent would ask if it were designed to ask them
Structure codex exec prompts with the four-component framework to eliminate ambiguity before submission
Use steer mode when you see the agent heading down the wrong path
Build MCP servers with elicitation support for domain-specific disambiguation

The 74% improvement that AMBIG-SWE demonstrates is not an incremental gain — it is the difference between an agent that silently produces plausible-looking wrong code and one that collaborates with you to produce correct code. The tools exist in Codex CLI today. The research tells us we need to use them more deliberately.

Citations

Vijayvargiya, S., Zhou, X., Yerukola, A., Sap, M., & Neubig, G. (2026). “AMBIG-SWE: Interactive Agents to Overcome Underspecificity in Software Engineering.” ICLR 2026. arXiv:2502.13069 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³
OpenAI. “Codex CLI Features.” developers.openai.com/codex/cli/features ↩
OpenAI. “Codex Prompting Guide.” OpenAI Cookbook, 2026. developers.openai.com/codex/learn/prompting ↩
OpenAI. “Non-interactive mode — Codex.” developers.openai.com/codex/noninteractive ↩
OpenAI. “Best Practices — Codex.” developers.openai.com/codex/learn/best-practices ↩ ↩² ↩³
Blake Crosley. “AGENTS.md Patterns: What Actually Changes Agent Behavior.” blakecrosley.com/blog/agents-md-patterns ↩
OpenAI. “Codex CLI Changelog — v0.119.0.” developers.openai.com/codex/changelog ↩
SmartScope. “Codex Plan Mode: Stop Code Drift with Plan→Execute (2026).” smartscope.blog ↩
OpenAI. “Custom Instructions with AGENTS.md.” developers.openai.com/codex/guides/agents-md ↩