MiniMax M3: What the First Open-Weight Model to Beat GPT-5.5 on SWE-Bench Pro Means for Codex CLI Model Routing
MiniMax M3: What the First Open-Weight Model to Beat GPT-5.5 on SWE-Bench Pro Means for Codex CLI Model Routing
On 1 June 2026, Shanghai-based MiniMax released M3 — a 229.9-billion-parameter Mixture-of-Experts model with 9.8 billion parameters active per token, a one-million-token context window, and native multimodal support for text, image, and video input 1. Its headline claim: 59.0% on SWE-Bench Pro, edging past both GPT-5.5 (58.6%) and Gemini 3.1 Pro on the benchmark that has become the de facto measure of agentic coding capability 2. At $0.30 per million input tokens and $1.20 per million output tokens on OpenRouter — roughly 8-12x cheaper than GPT-5.5 — M3 forces a recalculation of every Codex CLI model routing decision 3.
This article examines the benchmark results in context, unpacks the architectural innovations that make M3’s long-context performance practical, and provides concrete Codex CLI configuration recipes for integrating M3 into a multi-model routing strategy.
The Benchmark Picture: Competitive but Complicated
M3’s SWE-Bench Pro score of 59.0% places it in genuinely frontier territory for an open-weight model 2. But the number deserves scrutiny.
| Benchmark | MiniMax M3 | GPT-5.5 | Claude Opus 4.7 | Kimi K2.7 Code |
|---|---|---|---|---|
| SWE-Bench Pro | 59.0% | 58.6% | — | — |
| Terminal-Bench 2.1 | 66.0% | 82.0% | — | — |
| MCP Atlas | 74.2% | — | — | — |
| KernelBench Hard | 28.8% | — | — | — |
| PostTrainBench | 0.37 | 0.39 | 0.42 | — |
Three caveats matter for practitioners:
-
Vendor-reported scores. M3’s SWE-Bench Pro result is self-reported by MiniMax and has not yet appeared on the independently verified leaderboard 4. GPT-5.5’s scores are third-party verified.
-
The noise floor. The 0.4-point gap between M3 (59.0%) and GPT-5.5 (58.6%) on SWE-Bench Pro falls within the benchmark’s measurement uncertainty. An ICSE 2026 patch correctness study found SWE-bench systematically overestimates scores by 3.8-5.2 percentage points 5.
-
Terminal-Bench tells a different story. GPT-5.5 scores 82% on Terminal-Bench 2.1 versus M3’s 66.0% — a 16-point gap that reflects GPT-5.5’s native advantage when operating inside the Codex CLI harness 2.
The practical reading: M3 is a genuine peer on repository-level bug-fixing tasks but remains behind on the interactive terminal workflows where Codex CLI spends most of its time.
Architecture: Why MSA Matters for Long Sessions
M3 introduces MiniMax Sparse Attention (MSA), a mechanism that makes one-million-token contexts economically viable 1. Traditional dense attention scales quadratically with sequence length. MSA uses a “KV outer gather Q” approach — each key-value block is read once, memory access is contiguous, and arithmetic intensity improves substantially over earlier sparse methods such as DSA and MoBA 1.
The numbers are striking: per-token compute at one million tokens drops to one-twentieth of M3’s predecessor, with more than 9x faster prefilling and more than 15x faster decoding 1. MiniMax reports MSA is 4x faster than open-source Flash-Sparse-Attention and flash-moba implementations 1.
graph LR
A[Input Tokens] --> B[MoE Router<br/>256 experts]
B --> C[9.8B Active<br/>Parameters]
C --> D[MSA Layer<br/>KV outer gather Q]
D --> E[1M Token<br/>Context Window]
E --> F[Output]
style D fill:#f9f,stroke:#333
For Codex CLI users, this matters in two scenarios:
- Large-repository exploration. When
tool_output_token_limitis set high and the agent reads multiple files in sequence, M3 can maintain coherence across a broader context without the compaction events that interrupt GPT-5.5 sessions. - Long-horizon agentic tasks. MiniMax demonstrated M3 autonomously reproducing an ICLR 2025 Outstanding Paper across 18 commits over twelve hours, and improving a CUDA kernel’s hardware utilisation from 7.6% to 71.3% over 24 hours with 1,959 tool calls 1.
The Cost Equation
The pricing gap between M3 and GPT-5.5 is the most immediately actionable finding for teams managing Codex CLI budgets.
| MiniMax M3 (OpenRouter) | GPT-5.5 (OpenAI) | Ratio | |
|---|---|---|---|
| Input (per 1M tokens) | $0.30 | $5.00 | 16.7x cheaper |
| Output (per 1M tokens) | $1.20 | $30.00 | 25.0x cheaper |
MiniMax also offers subscription tiers through its own platform: Plus ($20/month, ~1.7B tokens), Max ($50/month, ~5.1B tokens), and Ultra ($120/month, ~9.8B tokens) 1. For teams running batch operations via codex exec, the per-token API pricing through OpenRouter is typically more economical.
The cost arithmetic is simple. A typical Codex CLI session consuming 100,000 input tokens and 20,000 output tokens costs approximately $0.17 with GPT-5.5 versus $0.05 with M3. Over 50 sessions per day, that is $8.50 versus $2.50 — a saving of $180 per developer per month.
Configuring M3 as a Codex CLI Provider
Direct API Access
Add MiniMax as a custom provider in ~/.codex/config.toml:
[model_providers.minimax]
name = "MiniMax"
base_url = "https://api.minimax.io/v1"
env_key = "MINIMAX_KEY"
Set your API key:
export MINIMAX_KEY="<your-key-here>"
Via OpenRouter
If you already use OpenRouter for multi-provider routing:
[model_providers.openrouter]
name = "OpenRouter"
base_url = "https://openrouter.ai/api/v1"
env_key = "OPENROUTER_KEY"
Then reference the model as minimax/minimax-m3 when using the OpenRouter provider 3.
Named Profile for M3
Create a dedicated profile at ~/.codex/minimax.config.toml:
model = "minimax-m3"
model_provider = "minimax"
[model_providers.minimax]
name = "MiniMax"
base_url = "https://api.minimax.io/v1"
env_key = "MINIMAX_KEY"
Activate with:
codex --profile minimax "refactor the authentication module"
A Practical Routing Strategy
The benchmark data suggests a tiered routing approach where task complexity determines model selection.
flowchart TD
A[Incoming Task] --> B{Task Type?}
B -->|Interactive terminal<br/>multi-step debugging| C[GPT-5.5<br/>Terminal-Bench: 82%]
B -->|Repository-level<br/>bug fix / feature| D{Budget<br/>Constraint?}
B -->|Batch codex exec<br/>bulk operations| E[MiniMax M3<br/>8-12x cheaper]
D -->|Cost-sensitive| E
D -->|Quality-critical| C
E --> F[Review Output<br/>PostToolUse hook]
C --> G[Standard Flow]
F --> H{Passes<br/>quality gate?}
H -->|Yes| I[Accept]
H -->|No| C
Profile-Based Routing in Practice
Define three profiles that encode the routing decision:
# ~/.codex/default.config.toml — GPT-5.5 for interactive work
model = "gpt-5.5"
# ~/.codex/minimax.config.toml — M3 for cost-sensitive tasks
model = "minimax-m3"
model_provider = "minimax"
[model_providers.minimax]
name = "MiniMax"
base_url = "https://api.minimax.io/v1"
env_key = "MINIMAX_KEY"
# ~/.codex/batch.config.toml — M3 with tighter token budgets
model = "minimax-m3"
model_provider = "minimax"
model_auto_compact_token_limit = 80000
[model_providers.minimax]
name = "MiniMax"
base_url = "https://api.minimax.io/v1"
env_key = "MINIMAX_KEY"
Use them from CI or scripts:
# Batch processing with M3
codex exec --profile batch "update all copyright headers to 2026"
# Interactive debugging stays on GPT-5.5
codex --profile default
AGENTS.md Model Guidance
Encode routing preferences in your project’s AGENTS.md so the agent itself can inform model selection:
## Model Routing
- Routine refactoring, linting fixes, and documentation updates: use `--profile minimax`
- Multi-file architectural changes and debugging: use default GPT-5.5
- Batch operations via `codex exec`: use `--profile batch`
The Open-Weight Convergence
M3 is not an isolated event. The open-weight coding model landscape has compressed dramatically in the first half of 2026:
- Devstral Small 2 (24B): 68% on SWE-Bench Verified, runs on a single RTX 4090 6
- Kimi K2.7 Code: purpose-built for software engineering with dual OpenAI and Anthropic API compatibility 7
- DeepSeek V4-Pro: competitive on agentic benchmarks at a fraction of proprietary model costs 8
The pattern is clear: quality training data and specialised architectures matter more than raw parameter count. Skywork-SWE demonstrated this empirically — a 32B model fine-tuned on 8,209 rigorously validated trajectories achieved 38.0% on SWE-Bench Verified, outperforming 72B and 671B general-purpose models without specialised SWE training 9.
For Codex CLI users, this convergence means the model_provider configuration in config.toml is no longer a one-time decision. It is a continuously tuneable parameter that should respond to the evolving price-performance frontier.
Caveats and Risk Mitigation
Before routing production workloads through M3, consider these risks:
-
Unverified benchmarks. Until M3’s SWE-Bench Pro score appears on an independent leaderboard, treat 59.0% as an upper bound estimate 4.
-
Tool-call compatibility. M3 supports the OpenAI tool specification 3, but edge cases in complex multi-tool orchestration may behave differently from GPT-5.5. Test your specific MCP server configurations before switching profiles.
-
Availability and rate limits. MiniMax’s API infrastructure is newer and less battle-tested than OpenAI’s. For mission-critical workflows, configure a fallback in your CI pipeline:
codex exec --profile minimax "task" || codex exec --profile default "task"
-
Regional considerations. MiniMax offers separate API endpoints for international (
api.minimax.io) and Chinese (api.minimaxi.com) users 10. Ensure your endpoint matches your deployment region. -
Weight release timing. MiniMax announced open-source weights within ten days of the 1 June launch 1. Verify current availability before planning self-hosted deployments.
What to Do This Week
- Register for a MiniMax API key at platform.minimax.io or access M3 through your existing OpenRouter account.
- Create a
minimax.config.tomlprofile using the configuration above. - Run your existing
codex execbatch tasks through M3 for one week and compare output quality against GPT-5.5 results. - Add a PostToolUse hook that logs model, token count, and task outcome to a local CSV — you will want this data when GPT-5.6 arrives and the routing calculation changes again.
Citations
-
MiniMax, “MiniMax M3: Frontier Coding, 1M Context, Native Multimodality — All in One Model,” MiniMax Blog, 1 June 2026. https://www.minimax.io/blog/minimax-m3 ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8
-
MarkTechPost, “MiniMax Releases MiniMax M3 with MSA Architecture Supporting 1M-Token Context, Native Multimodality, and Agentic Coding,” 1 June 2026. https://www.marktechpost.com/2026/06/01/minimax-releases-minimax-m3-with-msa-architecture-supporting-1m-token-context-native-multimodality-and-agentic-coding/ ↩ ↩2 ↩3
-
OpenRouter, “MiniMax M3 - API Pricing & Benchmarks,” accessed 20 June 2026. https://openrouter.ai/minimax/minimax-m3/api ↩ ↩2 ↩3
-
ofox.ai, “MiniMax M3 vs GPT-5.5: SWE-Bench Pro, 8x Price Gap, A/B Both via ofox (2026),” June 2026. https://ofox.ai/blog/minimax-m3-vs-gpt-5-5-coding-benchmark-2026/ ↩ ↩2
-
ICSE 2026 Patch Correctness Study, as cited in ofox.ai comparison. SWE-bench systematic overestimation of 3.8-5.2 percentage points. ↩
-
Pinggy, “Best Open Source Self-Hosted LLMs for Coding in 2026,” June 2026. https://pinggy.io/blog/best_open_source_self_hosted_llms_for_coding/ ↩
-
Flowtivity, “Kimi K2.7 Code vs MiniMax M3: Open-Source AI Coding Models Compared (June 2026),” June 2026. https://flowtivity.ai/blog/kimi-k2-7-code-vs-minimax-m3/ ↩
-
kilo.ai, “Best Open-Source & Open-Weight Coding Models (2026),” accessed 20 June 2026. https://kilo.ai/open-source-models ↩
-
Zeng et al., “Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs,” arXiv:2506.19290, June 2025. https://arxiv.org/abs/2506.19290 ↩
-
MorphLLM, “Codex config.toml (2026): Add Any Custom Provider in 6 Lines,” accessed 20 June 2026. https://www.morphllm.com/codex-provider-configuration ↩