Session Resilience at Last: How Codex CLI v0.141–v0.142 Stopped Exec-Server and MCP Sessions Dying on Transient Disconnects

For months, long-running Codex CLI sessions had a notorious failure mode: a transient network hiccup — a Wi-Fi blip, a VPN reconnection, a cloud proxy rotating its signed URL — would silently kill exec-server processes and stdio MCP server connections, leaving the session in a permanently degraded state where every subsequent tool call returned Transport closed ¹. Fresh codex exec subprocesses on the same machine continued working perfectly, confirming the problem lay in session-level connection management rather than the servers themselves ².

Versions 0.141.0 (18 June 2026) and 0.142.0 (22 June 2026) shipped a coordinated set of fixes that finally make exec-server processes and stdio MCP sessions survive transient disconnects, including signed-URL refresh and retry-safe stdin writes ³⁴. This article examines the problem, the architectural response, and what you need to configure to benefit.

The Problem: One-Shot Initialisation and No Recovery Path

The root cause was architectural. Codex’s MCP client used AsyncManagedClient with a cached Shared<Future> startup result — initialisation was effectively one-shot ⁵. RmcpClient had Connecting → Ready states but no explicit Disconnected or Reconnecting lifecycle ⁵. When the underlying stdio pipe or WebSocket dropped, the client had nowhere to go.

stateDiagram-v2
    [*] --> Connecting
    Connecting --> Ready : initialise succeeds
    Ready --> Dead : transport drops
    Dead --> Dead : all subsequent calls fail
    note right of Dead : No recovery path existed

The model API’s SSE streams already had retry-with-backoff logic, but the MCP transport path had no equivalent resilience ⁵. Practically, recovery meant either issuing a manual RefreshMcpServers command, triggering config/mcpServer/reload, or restarting the entire CLI process ⁵.

For interactive sessions, that was merely annoying. For codex exec pipelines and long-horizon goal threads running unattended, it was catastrophic — a brief network interruption could silently derail hours of autonomous work.

What v0.141 and v0.142 Changed

The fixes spanned at least four pull requests (#28374, #28512, #28546, #28895) across the two releases ⁶. Together they addressed three distinct resilience surfaces.

1. Exec-Server Process Survival

The exec-server — Codex’s sandboxed command execution and filesystem operations layer, which can run in the same process as the app-server or be split across machines ⁷ — now maintains process state across transient connection drops. Previously, a broken WebSocket between the app-server and a remote exec-server would orphan running processes. The new implementation preserves the process handle and output stream, reconnecting the notification channel (process/outputDelta, process/exited) once the transport recovers ⁷.

2. Stdio MCP Session Continuity

Stdio MCP servers — launched via command with optional args, env, and cwd configuration ⁸ — now survive the same class of transient disconnects. The fix introduces transport-layer reconnection: when the stdio pipe signals an error, the client recreates the transport, re-runs initialize and list-tools, and retries the failed request ⁵⁶.

stateDiagram-v2
    [*] --> Connecting
    Connecting --> Ready : initialise succeeds
    Ready --> Disconnected : transport drops
    Disconnected --> Reconnecting : detect error
    Reconnecting --> Ready : re-initialise + list-tools
    Reconnecting --> Failed : max retries exceeded
    Ready --> [*] : explicit shutdown

3. Signed-URL Refresh and Retry-Safe Stdin

Remote exec-server connections authenticated via signed URLs now handle credential rotation transparently ³. When a signed URL expires mid-session, the client refreshes it before retrying. Stdin writes are made retry-safe — idempotent markers prevent duplicate input delivery if a write must be replayed after reconnection ³.

The App-Server Architecture Context

Understanding where these fixes sit requires a quick look at Codex’s two-process architecture ⁷:

Component	Responsibility	Transport
App-server	Conversations, model interaction, thread management	stdio (JSONL), WebSocket, Unix socket
Exec-server	Sandboxed command execution, filesystem operations	WebSocket (experimental since v0.119.0)

The app-server implements bidirectional JSON-RPC 2.0 messaging ⁷. When request ingress is full, it rejects new requests with error code -32001 (Server overloaded; retry later) and clients are expected to implement exponential backoff with jitter ⁷. The exec-server exposes a WebSocket endpoint for remote or app-server workflows ⁷.

Threads have a lifecycle of create → work → compact → archive → restore, with a 30-minute grace period before idle threads with no subscribers are unloaded ⁷. The turn/steer API appends input to in-flight turns without creating new turns — critical for handling intermediate client disconnections during active work ⁷.

flowchart LR
    subgraph Client["CLI / Desktop App"]
        TUI["TUI / UI"]
    end
    subgraph AppServer["App Server"]
        TP["Thread Pool"]
        MCP["MCP Client Manager"]
    end
    subgraph ExecServer["Exec Server"]
        SB["Sandbox"]
        FS["Filesystem"]
    end
    subgraph MCPServers["Stdio MCP Servers"]
        S1["Server A"]
        S2["Server B"]
    end

    TUI -->|"JSON-RPC (stdio/WS)"| TP
    TP -->|"WebSocket"| SB
    MCP -->|"stdio pipe"| S1
    MCP -->|"stdio pipe"| S2

    style MCP fill:#f96,stroke:#333
    style SB fill:#f96,stroke:#333

The orange-highlighted components — the MCP client manager and the sandbox connection — are where the resilience fixes land.

Configuration: Required vs Optional MCP Servers

The required flag in your MCP server configuration now interacts with the resilience layer ⁸. If you mark a server as required: true and it fails to initialise, thread/start and thread/resume both fail rather than degrading gracefully ⁸. With the new reconnection logic, a transiently failing required server will be retried before the thread is declared failed.

# .codex/config.toml — MCP server with resilience-aware settings
[mcp_servers.my-tools]
command = "npx"
args = ["-y", "@my-org/mcp-tools"]
required = true
startup_timeout_sec = 15
tool_timeout_sec = 120

The startup_timeout_sec (default 10 seconds) and tool_timeout_sec (default 60 seconds) are worth tuning for servers that need warm-up time or perform heavy operations ⁸. The reconnection logic respects these timeouts on re-initialisation.

MCP Startup Status Notifications

v0.141 added richer MCP server status visibility for app-server integrations via mcpServer/startupStatus/updated notifications ⁴. Clients can now observe the full lifecycle:

Server starting
Server ready (tools discovered)
Server disconnected (transient)
Server reconnecting
Server failed (permanent)

This is particularly useful for desktop app and IDE integrations that need to surface MCP health in their UI.

What Remains Open

The resilience improvements in v0.141–v0.142 are significant but not complete. Several issues remain open on the repository:

No proactive health monitoring. There is still no heartbeat or periodic health check to detect a dropped MCP transport before the next tool call fails ⁵. Detection remains reactive — the reconnection logic triggers on the first failed operation after a disconnect.
Startup-time failures. Servers that never connect initially (as distinct from those that connect and then drop) still lack an automatic retry mechanism beyond the initial startup_timeout_sec window ⁹. Only the built-in codex_apps server benefits from cached tool snapshots for survivorship ⁹.
WebSocket fallback delays. Reconnection through WebSocket transports can still experience delays, particularly with idle connections and subagent-triggered stream disconnects ¹⁰.

⚠️ The exact retry count and backoff strategy for the new MCP reconnection logic are not documented in the public changelog or configuration reference. The values appear to be hard-coded in the Rust client implementation.

Practical Implications

For Interactive Sessions

The days of manually restarting MCP servers mid-session should be largely over. If you previously ran /mcp reload after every VPN reconnection or laptop lid-close, the new transport-layer reconnection handles this automatically.

For codex exec Pipelines

Unattended codex exec runs are the primary beneficiary. A CI runner experiencing a brief network interruption no longer produces silently degraded results — the exec-server process survives and MCP sessions reconnect.

For Remote Development

Combined with the Noise Protocol relay channels also shipped in v0.141 ⁴, remote exec-server connections now have both encryption and resilience. Cross-platform remote execution preserves executor-native working directories, shells, AGENTS.md discovery, and sandbox behaviour across operating systems ³⁴.

For Goal Threads

v0.142 also restored goal-first thread persistence — threads with goals are once again persisted and returned by thread/list ³. Combined with session resilience, this means long-horizon goal threads can survive both context compaction and transient disconnects without losing their objective.

Upgrading

# Update to the latest stable release
codex update

# Verify your version
codex --version
# Should show 0.142.0 or later

# Check MCP server health
codex doctor
# The Authentication and Background Server sections
# now reflect reconnection-capable MCP state

If you are pinning versions for your team (via requirements.toml), update your minimum:

[codex]
min_version = "0.142.0"

Citations

GitHub Issue #16899 — CLI session loses stdio MCP connections after initial successful calls; fresh codex exec still works. https://github.com/openai/codex/issues/16899 ↩
GitHub Issue #16899, reporter observation — fresh subprocess probes maintain 100% success rate while session-level calls fail. https://github.com/openai/codex/issues/16899 ↩
OpenAI Codex Changelog — v0.142.0 (22 June 2026): “Exec-server processes and stdio MCP sessions now survive transient disconnects, including signed-URL refresh and retry-safe stdin writes.” https://developers.openai.com/codex/changelog ↩ ↩² ↩³ ↩⁴ ↩⁵
OpenAI Codex Changelog — v0.141.0 (18 June 2026): “Authenticated, end-to-end encrypted Noise relay channels” for remote executors; richer MCP server status visibility. https://developers.openai.com/codex/changelog ↩ ↩² ↩³ ↩⁴
GitHub Issue #11489 — MCP client does not auto-reconnect after disconnect. Documents AsyncManagedClient one-shot initialisation, missing Disconnected/Reconnecting lifecycle, and lack of health monitoring. https://github.com/openai/codex/issues/11489 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Releasebot — Codex Updates June 2026. PRs #28374, #28512, #28546, #28895 referenced for resilience improvements. https://releasebot.io/updates/openai/codex ↩ ↩²
OpenAI Developer Documentation — App Server. Architecture documentation covering JSON-RPC 2.0 protocol, thread lifecycle, process/spawn API, and turn/steer mechanics. https://developers.openai.com/codex/app-server ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸
OpenAI Developer Documentation — Model Context Protocol. MCP server configuration reference including required, startup_timeout_sec, tool_timeout_sec, and transport options. https://developers.openai.com/codex/mcp ↩ ↩² ↩³ ↩⁴
GitHub Issue #11489, comment by HamburgJ (20 June 2026) — startup-time failures lack recovery; only codex_apps benefits from cached tool snapshots. https://github.com/openai/codex/issues/11489 ↩ ↩²
SmartScope — Codex CLI Reconnecting Fix: WebSocket Fallback and Recovery Guide (June 2026). https://smartscope.blog/en/generative-ai/chatgpt/codex-cli-reconnecting-issue-2025/ ↩