Open-Weight Agentic Models Are Closing the Gap: What OpenThoughts-Agent and Tmax Mean for Codex CLI Custom Provider Workflows

Two papers landed on arXiv within 48 hours of each other in late June 2026 and, taken together, they mark a step change in what open-weight models can do inside agentic coding harnesses. OpenThoughts-Agent (OT-Agent) from a 50-researcher consortium released a six-stage data curation pipeline and a 100K-example training set that pushes a fine-tuned Qwen3-32B to 54.0% on SWE-Bench Verified-100 — a 12.1 percentage point jump over the previous best open agentic model.¹ One day earlier, Allen AI’s Tmax delivered an RL-only recipe that gets a 9B model to 27.2% on Terminal-Bench 2.0, approaching Claude Haiku 4.5’s 29.8% at a fraction of the parameter count.²

For Codex CLI users, the practical question is straightforward: can you wire these models into your terminal agent today, and should you? This article covers the research, the benchmark numbers, the Codex CLI custom provider configuration, and the trade-offs you need to weigh.

The Research: Two Complementary Recipes

OpenThoughts-Agent — Data Diversity as the Bottleneck

The OT-Agent team ran over 100 controlled ablation experiments to isolate what matters when curating training data for agentic models.¹ Their six-stage SFT pipeline covers task sourcing, mixing, augmentation, filtering, teacher model selection, and rollout filtering. Three findings stand out:

Task sourcing strategy selection shifts performance by up to 30 percentage points on SWE-Bench. The team evaluated 95 different task generation strategies; synthetic issue-resolution tasks and human-written infrastructure questions scored highest.¹
The best teacher model is not the strongest model. GLM 4.7 produced higher-quality training traces than newer, more capable models — a counterintuitive result suggesting that teacher selection deserves its own ablation budget.¹
Rollout traces with fewer than five tool-use turns degraded downstream performance. Filtering for depth, not just correctness, matters.¹

The resulting OpenThinkerAgent-32B achieves 44.8% averaged across seven benchmarks — SWE-Bench Verified-100, Terminal-Bench 2.0, OpenThoughts-TBLite, Aider Polyglot, BFCL-Parity, GAIA-127, and FinanceAgent-Terminal — versus 40.9% for the previous state-of-the-art Nemotron-Terminal-32B.¹

Tmax — RL Without Reward Models

Tmax takes a different approach: pure reinforcement learning with outcome-only rewards and no learned reward model.² The recipe uses DPPO (Decoupled Proximal Policy Optimisation) with several stability innovations:

FP32 language model head to minimise train-time vs inference-time logprob mismatches that cause training collapse at 200–300 steps.²
Active sampling that filters zero-gradient groups and uses a group size of 32 with 64 maximum tool calls per episode.²

The TMAX-15K dataset comprises 14,600 RL environment instances — 2.5x larger than any prior open terminal-agent dataset — generated through a compositional taxonomy combining difficulty tiers, personas, and verifier diversification.² Domain balance across nine categories reaches 0.998, and the dataset’s pass@8 score of 53% indicates sustained challenge compared to easier alternatives.²

Key results on Terminal-Bench 2.0:²

Model	Parameters	Score
Tmax-9B	9B	27.2%
Tmax-27B	27B	42.7%
Claude Haiku 4.5	—	29.8%
Kimi K2.5	~1T	43.2%

Tmax-27B at 42.7% approaches Kimi K2.5 (43.2%) despite being 10–40x smaller.² Crucially, improvements transfer: the Tmax recipe applied to Qwen3-8B yields a +9.5 point gain on SWE-Bench and +9–15 points across different evaluation harnesses.²

What This Means for Codex CLI

The Custom Provider Path

Codex CLI supports custom model providers through config.toml.³ Since the Chat Completions API was deprecated in favour of the Responses API, any provider — including local inference servers — must implement the Responses wire protocol.⁴ For open-weight models served through Ollama or vLLM, the built-in --oss flag simplifies configuration.⁵

A minimal provider configuration for a locally served OpenThinkerAgent-32B:

[model_providers.openthoughts]
name = "OpenThoughts-Agent via Ollama"
base_url = "http://localhost:11434/v1"
env_key = "OLLAMA_API_KEY"
wire_api = "responses"

Then create a profile at ~/.codex/openthoughts.config.toml:

model = "openthoughts-agent-32b"
model_provider = "openthoughts"

Activate with:

codex --oss --profile openthoughts

For Tmax-27B via vLLM or LM Studio, the same pattern applies — swap the model name and base URL.⁵

Profile-Based Model Routing

The real power emerges when you combine open-weight models with frontier models in a profile-based routing strategy.⁶ Consider three profiles:

# ~/.codex/frontier.config.toml — complex multi-file tasks
model = "gpt-5.5"

# ~/.codex/open-swe.config.toml — SWE-style issue resolution
model = "openthoughts-agent-32b"
model_provider = "openthoughts"

# ~/.codex/open-terminal.config.toml — terminal/infra tasks
model = "tmax-27b"
model_provider = "local_vllm"

This gives you cost control without sacrificing capability. Terminal tasks where Tmax-27B scores within a point of Kimi K2.5 can run locally at zero marginal token cost, whilst complex multi-file refactors still route through GPT-5.5.⁶

The Capability Gap Remains — But It Is Narrower

graph LR
    subgraph "SWE-Bench Verified (Frontier)"
        A["Claude Opus 4.7<br/>87.6%"]
        B["GPT-5.3 Codex<br/>85.0%"]
    end
    subgraph "SWE-Bench Verified-100 (Open-Weight)"
        C["OT-Agent-32B<br/>54.0%"]
        D["Nemotron-Terminal-32B<br/>41.9%"]
    end
    subgraph "Terminal-Bench 2.0"
        E["Tmax-27B<br/>42.7%"]
        F["Claude Haiku 4.5<br/>29.8%"]
        G["Tmax-9B<br/>27.2%"]
    end

On SWE-Bench Verified, frontier models still lead by 30+ percentage points.⁷ But the trajectory matters: OT-Agent closed 12.1 points in a single release, and Tmax showed that RL-only training on a 9B model can match a frontier lightweight model on terminal tasks.¹²

For Codex CLI workflows, the practical implication is that certain task categories — bounded terminal operations, infrastructure scripting, routine issue triage — are now viable candidates for local open-weight execution.

Configuration Hardening for Open-Weight Providers

Open-weight models lack the safety fine-tuning depth of frontier APIs. When routing through Codex CLI, compensate with harness-level controls:

Sandbox Constraints

# In your open-weight profile
sandbox = "read-only"

Start with read-only sandbox mode and escalate to workspace-write only after validating output quality on your codebase.³

Token Budget Caps

rollout_token_budget = 50000
tool_output_token_limit = 8000

Open-weight models exhibit higher token variance than frontier models.⁸ Hard budget caps prevent runaway sessions. The rollout_token_budget key enforces a ceiling across the entire session.³

PostToolUse Verification Hooks

Add verification hooks in your project’s AGENTS.md to catch quality regressions:

## Quality Gates

- After every file write, run the project's lint and type-check commands
- After every test execution, verify no regressions in the existing suite
- Never commit directly; always create a branch for review

These rules apply regardless of which model is active, but they matter more when the model has lower baseline accuracy.⁹

AGENTS.md as Structural Constraint

The hierarchical AGENTS.md lookup — user-level, project root, per-directory — provides structural constraints that survive model swaps.⁹ A well-written AGENTS.md file compensates for model capability differences by encoding project-specific rules that any model must follow.

Training Your Own: The Data Recipe Takeaways

Both papers offer actionable lessons for teams considering fine-tuning their own agentic models:

Source diversity trumps source volume. OT-Agent found that mixing the top 4–8 task sources outperformed scaling any single source.¹
Skip synthetic augmentation of task descriptions. Every augmentation strategy OT-Agent tested — rephrasing, elaboration, decomposition — failed to improve on leaving the original task description untouched.¹
Filter for depth. Rollouts with fewer than five turns should be discarded. Shallow traces teach the model to give up early.¹
RL stability requires FP32 LM heads. Tmax’s key engineering insight: numerical mismatches between training and inference logprobs cause training collapse. Promoting the language model head to FP32 resolves this.²
Strong models may resist SFT warm-up. Tmax found that Qwen 3.5-9B — already heavily post-trained — actually degraded with SFT initialisation, whilst the less-tuned Qwen3-8B benefited substantially.²

What to Watch

The OT-Agent team released all training sets, pipeline code, and models at openthoughts.ai.¹ Tmax released models, the TMAX-15K dataset, and training code on GitHub and Hugging Face.² Both are Apache 2.0 licensed.

The gap between open-weight and frontier agentic performance is compressing faster than many expected. For Codex CLI users, the custom provider infrastructure is already in place — the question is no longer whether you can run open-weight agents in your terminal, but when the accuracy-cost trade-off tips in their favour for your specific workload.

Citations

Raoof, N., Zhuang, R., Nezhurina, M., et al. (2026). “OpenThoughts-Agent: Data Recipes for Agentic Models.” arXiv:2606.24855. https://arxiv.org/abs/2606.24855 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹
Ivison, H., Yin, J.O., Shao, R., Xiao, T., Lambert, N., & Hajishirzi, H. (2026). “Tmax: A Simple Recipe for Terminal Agents.” arXiv:2606.23321. https://arxiv.org/abs/2606.23321 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³
OpenAI. (2026). “Advanced Configuration — Codex CLI.” https://developers.openai.com/codex/config-advanced ↩ ↩² ↩³
OpenAI. (2026). “Configuration Reference — Codex CLI.” https://developers.openai.com/codex/config-reference ↩
OpenAI. (2026). “Codex CLI — Command Line Options.” https://developers.openai.com/codex/cli/reference ↩ ↩²
Vaughan, D. (2026). “Codex CLI Custom Model Providers: The Complete Configuration Guide.” https://codex.danielvaughan.com/2026/04/23/codex-cli-custom-model-providers-configuration-guide/ ↩ ↩²
SWE-bench Verified Leaderboard. (2026). https://www.morphllm.com/swe-bench-pro ↩
Bai, Y. et al. (2026). “Where Do Your Tokens Go? Empirical Analysis of Token Consumption in Coding Agents.” arXiv:2604.22750. https://arxiv.org/abs/2604.22750 ↩
OpenAI. (2026). “Custom Instructions with AGENTS.md — Codex.” https://developers.openai.com/codex/guides/agents-md ↩ ↩²