Running GPT-OSS with Codex CLI: A Complete Guide to Local Inference via Ollama, LM Studio, and MLX

Running GPT-OSS with Codex CLI: A Complete Guide to Local Inference via Ollama, LM Studio, and MLX

OpenAI’s release of GPT-OSS — two open-weight models under the Apache 2.0 licence — changed the economics of agentic coding overnight1. The 120-billion-parameter variant scores 62.4% on SWE-Bench Verified and 2,622 Elo on Codeforces (with tools), approaching o4-mini territory while running on a single 80 GB GPU2. The 20-billion-parameter variant fits in 16 GB of memory and still manages 60.7% on SWE-Bench Verified2. Codex CLI supports both models natively through its --oss flag and provider configuration, giving practitioners a zero-API-cost, privacy-preserving agentic coding workflow.

This article covers the architecture of both models, walks through three local inference backends — Ollama, LM Studio, and MLX — and provides production-ready config.toml recipes for daily use.

GPT-OSS Architecture at a Glance

Both models use a Mixture-of-Experts (MoE) Transformer architecture with MXFP4 quantisation on expert weights12. The efficiency gains come from activating only a fraction of total parameters per token.

Specification gpt-oss-120b gpt-oss-20b
Total parameters 116.8B 20.9B
Active parameters per token 5.1B 3.6B
Transformer layers 36 24
Experts per MoE block 128 (top-4 routing) 32
Checkpoint size 60.8 GiB 12.8 GiB
Minimum memory 80 GB (single H100) 16 GB
Context window 131,072 tokens (YaRN) 131,072 tokens (YaRN)

The models use the o200k_harmony tokeniser and support configurable reasoning effort levels (low, medium, high) with full chain-of-thought access2. Both support function calling, structured outputs, and agentic tool use out of the box — the capabilities Codex CLI depends on.

How Codex CLI Discovers Local Models

When you pass --oss, Codex CLI reads the oss_provider key from ~/.codex/config.toml to determine which local backend to use3. If oss_provider is not set, the CLI prompts you to choose between the built-in ollama and lmstudio providers. The default model for --oss is gpt-oss:20b4.

flowchart LR
    A["codex --oss"] --> B{oss_provider set?}
    B -- Yes --> C["Use configured provider"]
    B -- No --> D["Prompt: Ollama or LM Studio"]
    C --> E["Connect to local endpoint"]
    D --> E
    E --> F["Start agent loop with local model"]

You can override the model on any invocation with -m:

codex --oss -m gpt-oss:120b

For anything more sophisticated — custom ports, multiple backends, or cloud-local switching — you need named profiles in config.toml.

Backend 1: Ollama

Ollama is the simplest path. It handles model downloads, quantisation, and serves an OpenAI-compatible API on port 114344.

Setup

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull the model
ollama pull gpt-oss:120b    # or gpt-oss:20b for smaller hardware

# Verify it is running
ollama list

Context Window Configuration

Codex requires a minimum of 64,000 tokens of context5. Ollama defaults to a smaller window, so you must increase it:

# Set context length for the current session
/set parameter num_ctx 65536

For persistent configuration, create a Modelfile:

FROM gpt-oss:120b
PARAMETER num_ctx 65536

Then build and use it:

ollama create gpt-oss-codex -f Modelfile
codex --oss -m gpt-oss-codex

Config.toml Profile

oss_provider = "ollama"

[profiles.local-120b]
model_provider = "ollama"
model = "gpt-oss:120b"

Launch with:

codex --profile local-120b

Alternatively, Ollama offers a one-command setup:

ollama launch codex

This pulls the default model, configures the context window, and starts Codex in one step5.

Backend 2: LM Studio

LM Studio provides a desktop GUI for model management and serves an OpenAI-compatible endpoint on port 12346.

Setup

# Load the model (from LM Studio CLI)
lms load gpt-oss-120b
lms server start

Ensure the context length is set to at least 65,536 tokens via the --context-length flag or the GUI settings.

Config.toml Profile

[model_providers.lms]
name = "LM Studio"
base_url = "http://localhost:1234/v1"

[profiles.lms-120b]
model_provider = "lms"
model = "gpt-oss:120b"

[profiles.lms-qwen]
model_provider = "lms"
model = "qwen/qwen3-coder-30b"

LM Studio’s advantage is switching between models without re-pulling — useful when comparing GPT-OSS against alternatives like Qwen3-Coder-30B6.

Backend 3: MLX on Apple Silicon

For Mac users, MLX provides native Apple Silicon inference without the overhead of an emulation layer6.

Setup

pip install mlx-lm
mlx_lm.server --model SuperagenticAI/gpt-oss-20b-8bit-mlx --port 8888

The 8-bit quantised gpt-oss-20b fits comfortably on a MacBook Pro with 32 GB of unified memory. The 120b variant requires substantially more — realistically an M2 Ultra or M4 Max with 192 GB6.

Config.toml Profile

[model_providers.mlx]
name = "MLX LM"
base_url = "http://localhost:8888/v1"

[profiles.mlx-20b]
model_provider = "mlx"
model = "SuperagenticAI/gpt-oss-20b-8bit-mlx"

A Multi-Profile Config.toml

In practice, you want profiles that let you switch between cloud and local with a flag. Here is a consolidated configuration:

# Default: cloud model for daily work
model = "gpt-5.5"
model_provider = "openai"

# Local fallback
oss_provider = "ollama"

# --- Profiles ---

[profiles.cloud]
model = "gpt-5.5"
model_provider = "openai"

[profiles.local]
model_provider = "ollama"
model = "gpt-oss:120b"

[profiles.local-small]
model_provider = "ollama"
model = "gpt-oss:20b"

[profiles.mlx]
model_provider = "mlx"
model = "SuperagenticAI/gpt-oss-20b-8bit-mlx"

# --- Providers ---

[model_providers.mlx]
name = "MLX LM"
base_url = "http://localhost:8888/v1"

Switch contexts without touching the file:

codex --profile cloud "explain this function"
codex --profile local "refactor the auth module"
codex --profile mlx "add unit tests for utils.ts"

Benchmark Reality Check

The headline benchmarks are impressive, but coding-agent performance in the real world depends heavily on tool-use reliability and context management. Here is how GPT-OSS stacks up on the benchmarks that matter for Codex CLI workflows:

Benchmark gpt-oss-120b gpt-oss-20b o4-mini (reference)
SWE-Bench Verified 62.4% 60.7% ⚠️ ~65% (estimated)
Codeforces Elo (tools) 2,622 2,516 ⚠️ ~2,700 (estimated)
MMLU 90.0% 85.3%
GPQA Diamond 80.1% 71.5% 81.4%
Tau-Bench Retail 67.8% 54.8%

Scores from the GPT-OSS model card with high reasoning effort2. Direct o4-mini comparisons on identical benchmarks are limited; estimates marked with ⚠️.

The 120b model’s 62.4% on SWE-Bench Verified puts it ahead of every other open-weight model and within striking distance of o4-mini2. For Codex CLI, the critical capability is function calling — both models support it natively through the harmony response format1.

When to Use Local vs Cloud

flowchart TD
    A["New Codex task"] --> B{Privacy sensitive?}
    B -- Yes --> C["Local: gpt-oss"]
    B -- No --> D{Budget constrained?}
    D -- Yes --> C
    D -- No --> E{Complex multi-file refactor?}
    E -- Yes --> F["Cloud: gpt-5.5"]
    E -- No --> G{Offline or air-gapped?}
    G -- Yes --> C
    G -- No --> H["Cloud: gpt-5.5 or gpt-5.4"]

Choose local GPT-OSS when:

  • Working with proprietary code that cannot leave your network
  • Running in air-gapped or regulated environments
  • Iterating on small-to-medium tasks where latency tolerance is higher
  • Optimising cost on high-volume codex exec batch jobs

Stick with cloud models when:

  • Tackling complex multi-file refactors where the extra capability of GPT-5.5’s million-token context matters
  • Using features that require server-side infrastructure (WebSocket mode, extended prompt cache retention)
  • Running CI pipelines where consistent latency matters more than cost

Known Limitations and Sharp Edges

Context window mismatch. Ollama’s default context window is far below what Codex needs. If you see truncation errors or degraded output quality, the first thing to check is num_ctx5.

No WebSocket transport. Local providers serve HTTP only. You lose the ~40% latency reduction that WebSocket mode provides with the cloud Responses API7.

No prompt caching. The server-side prompt cache that reduces cloud API costs does not exist for local inference. Every turn pays full compute cost.

Reasoning effort calibration. The model_reasoning_effort setting in config.toml maps to the model’s reasoning levels, but local inference does not benefit from the same optimisation passes that the cloud API applies. Expect higher latency at high effort, particularly on the 120b model2.

Tool-use reliability. While both models support function calling, community reports suggest the 20b variant occasionally misformats tool calls on complex multi-step tasks. The 120b model is significantly more reliable here6. ⚠️ This observation is from community reports, not systematic benchmarking.

Verifying Your Setup

After configuring a local provider, run a quick diagnostic:

# Check the active configuration
codex --profile local /debug-config

# Run a simple task to verify tool use
codex --profile local "list the files in this directory and explain the project structure"

# Test non-interactive mode
codex exec --profile local "echo hello from local GPT-OSS"

If the agent loop stalls or produces empty responses, check that:

  1. The provider is running (curl http://localhost:11434/v1/models for Ollama)
  2. The context window is large enough (64K+ tokens)
  3. The model supports function calling (GPT-OSS does; not all Ollama models do)

Citations