MiniMax M3: What the First Open-Weight Model to Beat GPT-5.5 on SWE-Bench Pro Means for Codex CLI Model Routing

On 1 June 2026, Shanghai-based MiniMax released M3 — a 229.9-billion-parameter Mixture-of-Experts model with 9.8 billion parameters active per token, a one-million-token context window, and native multimodal support for text, image, and video input ¹. Its headline claim: 59.0% on SWE-Bench Pro, edging past both GPT-5.5 (58.6%) and Gemini 3.1 Pro on the benchmark that has become the de facto measure of agentic coding capability ². At $0.30 per million input tokens and $1.20 per million output tokens on OpenRouter — roughly 8-12x cheaper than GPT-5.5 — M3 forces a recalculation of every Codex CLI model routing decision ³.

This article examines the benchmark results in context, unpacks the architectural innovations that make M3’s long-context performance practical, and provides concrete Codex CLI configuration recipes for integrating M3 into a multi-model routing strategy.

The Benchmark Picture: Competitive but Complicated

M3’s SWE-Bench Pro score of 59.0% places it in genuinely frontier territory for an open-weight model ². But the number deserves scrutiny.

Benchmark	MiniMax M3	GPT-5.5	Claude Opus 4.7	Kimi K2.7 Code
SWE-Bench Pro	59.0%	58.6%	—	—
Terminal-Bench 2.1	66.0%	82.0%	—	—
MCP Atlas	74.2%	—	—	—
KernelBench Hard	28.8%	—	—	—
PostTrainBench	0.37	0.39	0.42	—

Three caveats matter for practitioners:

Vendor-reported scores. M3’s SWE-Bench Pro result is self-reported by MiniMax and has not yet appeared on the independently verified leaderboard ⁴. GPT-5.5’s scores are third-party verified.
The noise floor. The 0.4-point gap between M3 (59.0%) and GPT-5.5 (58.6%) on SWE-Bench Pro falls within the benchmark’s measurement uncertainty. An ICSE 2026 patch correctness study found SWE-bench systematically overestimates scores by 3.8-5.2 percentage points ⁵.
Terminal-Bench tells a different story. GPT-5.5 scores 82% on Terminal-Bench 2.1 versus M3’s 66.0% — a 16-point gap that reflects GPT-5.5’s native advantage when operating inside the Codex CLI harness ².

The practical reading: M3 is a genuine peer on repository-level bug-fixing tasks but remains behind on the interactive terminal workflows where Codex CLI spends most of its time.

Architecture: Why MSA Matters for Long Sessions

M3 introduces MiniMax Sparse Attention (MSA), a mechanism that makes one-million-token contexts economically viable ¹. Traditional dense attention scales quadratically with sequence length. MSA uses a “KV outer gather Q” approach — each key-value block is read once, memory access is contiguous, and arithmetic intensity improves substantially over earlier sparse methods such as DSA and MoBA ¹.

The numbers are striking: per-token compute at one million tokens drops to one-twentieth of M3’s predecessor, with more than 9x faster prefilling and more than 15x faster decoding ¹. MiniMax reports MSA is 4x faster than open-source Flash-Sparse-Attention and flash-moba implementations ¹.

graph LR
    A[Input Tokens] --> B[MoE Router<br/>256 experts]
    B --> C[9.8B Active<br/>Parameters]
    C --> D[MSA Layer<br/>KV outer gather Q]
    D --> E[1M Token<br/>Context Window]
    E --> F[Output]
    style D fill:#f9f,stroke:#333

For Codex CLI users, this matters in two scenarios:

Large-repository exploration. When tool_output_token_limit is set high and the agent reads multiple files in sequence, M3 can maintain coherence across a broader context without the compaction events that interrupt GPT-5.5 sessions.
Long-horizon agentic tasks. MiniMax demonstrated M3 autonomously reproducing an ICLR 2025 Outstanding Paper across 18 commits over twelve hours, and improving a CUDA kernel’s hardware utilisation from 7.6% to 71.3% over 24 hours with 1,959 tool calls ¹.

The Cost Equation

The pricing gap between M3 and GPT-5.5 is the most immediately actionable finding for teams managing Codex CLI budgets.

	MiniMax M3 (OpenRouter)	GPT-5.5 (OpenAI)	Ratio
Input (per 1M tokens)	$0.30	$5.00	16.7x cheaper
Output (per 1M tokens)	$1.20	$30.00	25.0x cheaper

MiniMax also offers subscription tiers through its own platform: Plus ($20/month, ~1.7B tokens), Max ($50/month, ~5.1B tokens), and Ultra ($120/month, ~9.8B tokens) ¹. For teams running batch operations via codex exec, the per-token API pricing through OpenRouter is typically more economical.

The cost arithmetic is simple. A typical Codex CLI session consuming 100,000 input tokens and 20,000 output tokens costs approximately $0.17 with GPT-5.5 versus $0.05 with M3. Over 50 sessions per day, that is $8.50 versus $2.50 — a saving of $180 per developer per month.

Configuring M3 as a Codex CLI Provider

Direct API Access

Add MiniMax as a custom provider in ~/.codex/config.toml:

[model_providers.minimax]
name = "MiniMax"
base_url = "https://api.minimax.io/v1"
env_key = "MINIMAX_KEY"

Set your API key:

export MINIMAX_KEY="<your-key-here>"

Via OpenRouter

If you already use OpenRouter for multi-provider routing:

[model_providers.openrouter]
name = "OpenRouter"
base_url = "https://openrouter.ai/api/v1"
env_key = "OPENROUTER_KEY"

Then reference the model as minimax/minimax-m3 when using the OpenRouter provider ³.

Named Profile for M3

Create a dedicated profile at ~/.codex/minimax.config.toml:

model = "minimax-m3"
model_provider = "minimax"

[model_providers.minimax]
name = "MiniMax"
base_url = "https://api.minimax.io/v1"
env_key = "MINIMAX_KEY"

Activate with:

codex --profile minimax "refactor the authentication module"

A Practical Routing Strategy

The benchmark data suggests a tiered routing approach where task complexity determines model selection.

flowchart TD
    A[Incoming Task] --> B{Task Type?}
    B -->|Interactive terminal<br/>multi-step debugging| C[GPT-5.5<br/>Terminal-Bench: 82%]
    B -->|Repository-level<br/>bug fix / feature| D{Budget<br/>Constraint?}
    B -->|Batch codex exec<br/>bulk operations| E[MiniMax M3<br/>8-12x cheaper]
    D -->|Cost-sensitive| E
    D -->|Quality-critical| C
    E --> F[Review Output<br/>PostToolUse hook]
    C --> G[Standard Flow]
    F --> H{Passes<br/>quality gate?}
    H -->|Yes| I[Accept]
    H -->|No| C

Profile-Based Routing in Practice

Define three profiles that encode the routing decision:

# ~/.codex/default.config.toml — GPT-5.5 for interactive work
model = "gpt-5.5"

# ~/.codex/minimax.config.toml — M3 for cost-sensitive tasks
model = "minimax-m3"
model_provider = "minimax"

[model_providers.minimax]
name = "MiniMax"
base_url = "https://api.minimax.io/v1"
env_key = "MINIMAX_KEY"

# ~/.codex/batch.config.toml — M3 with tighter token budgets
model = "minimax-m3"
model_provider = "minimax"
model_auto_compact_token_limit = 80000

[model_providers.minimax]
name = "MiniMax"
base_url = "https://api.minimax.io/v1"
env_key = "MINIMAX_KEY"

Use them from CI or scripts:

# Batch processing with M3
codex exec --profile batch "update all copyright headers to 2026"

# Interactive debugging stays on GPT-5.5
codex --profile default

AGENTS.md Model Guidance

Encode routing preferences in your project’s AGENTS.md so the agent itself can inform model selection:

## Model Routing

- Routine refactoring, linting fixes, and documentation updates: use `--profile minimax`
- Multi-file architectural changes and debugging: use default GPT-5.5
- Batch operations via `codex exec`: use `--profile batch`

The Open-Weight Convergence

M3 is not an isolated event. The open-weight coding model landscape has compressed dramatically in the first half of 2026:

Devstral Small 2 (24B): 68% on SWE-Bench Verified, runs on a single RTX 4090 ⁶
Kimi K2.7 Code: purpose-built for software engineering with dual OpenAI and Anthropic API compatibility ⁷
DeepSeek V4-Pro: competitive on agentic benchmarks at a fraction of proprietary model costs ⁸

The pattern is clear: quality training data and specialised architectures matter more than raw parameter count. Skywork-SWE demonstrated this empirically — a 32B model fine-tuned on 8,209 rigorously validated trajectories achieved 38.0% on SWE-Bench Verified, outperforming 72B and 671B general-purpose models without specialised SWE training ⁹.

For Codex CLI users, this convergence means the model_provider configuration in config.toml is no longer a one-time decision. It is a continuously tuneable parameter that should respond to the evolving price-performance frontier.

Caveats and Risk Mitigation

Before routing production workloads through M3, consider these risks:

Unverified benchmarks. Until M3’s SWE-Bench Pro score appears on an independent leaderboard, treat 59.0% as an upper bound estimate ⁴.
Tool-call compatibility. M3 supports the OpenAI tool specification ³, but edge cases in complex multi-tool orchestration may behave differently from GPT-5.5. Test your specific MCP server configurations before switching profiles.
Availability and rate limits. MiniMax’s API infrastructure is newer and less battle-tested than OpenAI’s. For mission-critical workflows, configure a fallback in your CI pipeline:

codex exec --profile minimax "task" || codex exec --profile default "task"

Regional considerations. MiniMax offers separate API endpoints for international (api.minimax.io) and Chinese (api.minimaxi.com) users ¹⁰. Ensure your endpoint matches your deployment region.
Weight release timing. MiniMax announced open-source weights within ten days of the 1 June launch ¹. Verify current availability before planning self-hosted deployments.

What to Do This Week

Register for a MiniMax API key at platform.minimax.io or access M3 through your existing OpenRouter account.
Create a minimax.config.toml profile using the configuration above.
Run your existing codex exec batch tasks through M3 for one week and compare output quality against GPT-5.5 results.
Add a PostToolUse hook that logs model, token count, and task outcome to a local CSV — you will want this data when GPT-5.6 arrives and the routing calculation changes again.

Citations

MiniMax, “MiniMax M3: Frontier Coding, 1M Context, Native Multimodality — All in One Model,” MiniMax Blog, 1 June 2026. https://www.minimax.io/blog/minimax-m3 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸
MarkTechPost, “MiniMax Releases MiniMax M3 with MSA Architecture Supporting 1M-Token Context, Native Multimodality, and Agentic Coding,” 1 June 2026. https://www.marktechpost.com/2026/06/01/minimax-releases-minimax-m3-with-msa-architecture-supporting-1m-token-context-native-multimodality-and-agentic-coding/ ↩ ↩² ↩³
OpenRouter, “MiniMax M3 - API Pricing & Benchmarks,” accessed 20 June 2026. https://openrouter.ai/minimax/minimax-m3/api ↩ ↩² ↩³
ofox.ai, “MiniMax M3 vs GPT-5.5: SWE-Bench Pro, 8x Price Gap, A/B Both via ofox (2026),” June 2026. https://ofox.ai/blog/minimax-m3-vs-gpt-5-5-coding-benchmark-2026/ ↩ ↩²
ICSE 2026 Patch Correctness Study, as cited in ofox.ai comparison. SWE-bench systematic overestimation of 3.8-5.2 percentage points. ↩
Pinggy, “Best Open Source Self-Hosted LLMs for Coding in 2026,” June 2026. https://pinggy.io/blog/best_open_source_self_hosted_llms_for_coding/ ↩
Flowtivity, “Kimi K2.7 Code vs MiniMax M3: Open-Source AI Coding Models Compared (June 2026),” June 2026. https://flowtivity.ai/blog/kimi-k2-7-code-vs-minimax-m3/ ↩
kilo.ai, “Best Open-Source & Open-Weight Coding Models (2026),” accessed 20 June 2026. https://kilo.ai/open-source-models ↩
Zeng et al., “Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs,” arXiv:2506.19290, June 2025. https://arxiv.org/abs/2506.19290 ↩
MorphLLM, “Codex config.toml (2026): Add Any Custom Provider in 6 Lines,” accessed 20 June 2026. https://www.morphllm.com/codex-provider-configuration ↩