Codex CLI and Docker Model Runner: Containerised Local Inference for Private, Cost-Free Coding Agents

Introduction

Running Codex CLI against the OpenAI API is the default path — and for good reason. GPT-5.5’s 400K context window, server-side compaction, and prompt caching make it formidable¹. But not every task warrants an API call. Internal refactors on proprietary code, air-gapped enterprise environments, personal projects where you would rather not burn credits — these all benefit from local inference.

Docker Model Runner (DMR) provides a Docker-native route to local model serving that many developers overlook. Unlike Ollama, which runs as a standalone daemon, DMR is embedded in Docker Desktop and Docker Engine, manages models as OCI artifacts pulled from Docker Hub or Hugging Face, and exposes an OpenAI-compatible API that Codex CLI can target with a few lines of TOML². It even ships first-party Codex CLI skills via docker model skills --codex³.

This article covers the end-to-end workflow: enabling DMR, pulling a coding model, configuring Codex CLI as a provider, installing Docker-aware skills, and building profiles for a hybrid local-plus-cloud workflow.

Why Docker Model Runner?

Developers already have Docker installed. DMR leverages that existing investment rather than introducing another daemon. Key differentiators:

Feature	Docker Model Runner	Ollama
Installation	Built into Docker Desktop / `docker-model-plugin` for Engine²	Separate binary
Model format	OCI artifacts (GGUF, Safetensors)²	Modelfile + GGUF
Registry	Docker Hub, any OCI registry, Hugging Face²	ollama.com library
API endpoint	`http://localhost:12434/engines/v1` (host) / `http://model-runner.docker.internal/engines/v1` (container)⁴	`http://localhost:11434/v1`
Inference engines	llama.cpp (all platforms), vLLM (NVIDIA Linux/WSL2), Diffusers (image gen)²	llama.cpp only
GPU support	CUDA, ROCm, Vulkan, Apple Silicon²	CUDA, ROCm, Apple Silicon
Coding agent skills	`docker model skills --codex`³	None shipped
Resource management	Models loaded on-demand, unloaded when idle²	Models loaded on-demand

The Vulkan backend is particularly notable — it means DMR runs on virtually any modern GPU, including integrated Intel and AMD graphics².

Architecture

graph LR
    subgraph Developer Machine
        CLI["Codex CLI<br/>--profile docker-local"]
        DMR["Docker Model Runner<br/>localhost:12434"]
        Model["GPT-OSS-120B<br/>(GGUF, llama.cpp)"]
        Skills["~/.codex/skills/<br/>docker-model-runner/"]
    end

    CLI -->|"OpenAI-compatible<br/>POST /engines/v1/chat/completions"| DMR
    DMR -->|"llama.cpp inference"| Model
    Skills -.->|"Loaded at session start"| CLI

    subgraph Docker Hub
        OCI["ai/gpt-oss:120b<br/>(OCI Artifact)"]
    end

    OCI -->|"docker model pull"| Model

The flow is straightforward. DMR serves the model behind an OpenAI-compatible API on port 12434. Codex CLI connects to it via a custom model provider in config.toml. The DMR skills teach Codex how to manage models, troubleshoot inference, and optimise context windows for local hardware constraints.

Setup

1. Enable Docker Model Runner

Docker Desktop (macOS / Windows / Linux):

Navigate to Settings > AI and toggle Docker Model Runner on. For NVIDIA GPU users on Windows, also enable GPU-backed inference⁵.

Docker Engine (Linux headless):

# Ubuntu / Debian
sudo apt-get install docker-model-plugin

# RHEL / Fedora
sudo dnf install docker-model-plugin

Verify the installation:

docker model version

2. Pull a Coding Model

GPT-OSS models are optimised for Codex CLI’s tool-calling protocol. The 120B parameter variant is the recommended choice for local coding work; the 20B variant works on machines with 16 GB RAM⁶.

# Full-size model (requires ~80 GB RAM or GPU VRAM for Q4 quantisation)
docker model pull ai/gpt-oss:120b

# Smaller variant for constrained hardware
docker model pull ai/gpt-oss:20b

You can also pull third-party coding models from Hugging Face:

docker model pull hf.co/bartowski/Qwen3-Coder-Next-30B-GGUF

List downloaded models:

docker model ls

3. Test the Model

docker model run ai/gpt-oss:120b

This opens an interactive chat. Type a coding question to confirm inference works, then exit with Ctrl+C.

4. Install Codex CLI Skills

DMR ships skills that teach Codex how to interact with the Docker model ecosystem:

docker model skills --codex

This installs skill files to ~/.codex/skills/. You can install for multiple agents simultaneously:

docker model skills --codex --claude

Use --force to overwrite existing skills during upgrades³.

Codex CLI Configuration

Provider Definition

Add the DMR provider to ~/.codex/config.toml:

[model_providers.docker-runner]
name = "Docker Model Runner"
base_url = "http://localhost:12434/engines/v1"

No env_key is needed — DMR runs locally without authentication⁴.

Profiles

Create profiles for different local models and a hybrid workflow:

# Local full-size model for deep coding work
[profiles.docker-local]
model_provider = "docker-runner"
model = "ai/gpt-oss:120b"
reasoning_effort = "medium"

# Local small model for quick iterations
[profiles.docker-fast]
model_provider = "docker-runner"
model = "ai/gpt-oss:20b"
reasoning_effort = "low"

# Cloud model for complex, long-horizon tasks
[profiles.cloud]
model = "gpt-5.5"
reasoning_effort = "high"

Switch profiles from the command line:

# Private local inference
codex --profile docker-local "Refactor the auth module to use RBAC"

# Quick local iteration
codex --profile docker-fast "Add a unit test for the parseConfig function"

# Cloud power when needed
codex --profile cloud "Migrate the codebase from Express to Fastify"

The `--oss` Shortcut

If you only need Ollama or LM Studio, Codex CLI supports --oss as a convenience flag. For DMR, however, use the explicit profile approach above — the built-in --oss flag targets the reserved ollama and lmstudio provider IDs, not custom providers⁷.

Hybrid Workflow: Local for Privacy, Cloud for Power

The most practical pattern is not “local or cloud” but “both, depending on the task.” Context window size is the key decision variable.

flowchart TD
    A["New Codex Task"] --> B{"Sensitive code?"}
    B -->|Yes| C{"Task complexity?"}
    B -->|No| D["Cloud: --profile cloud<br/>GPT-5.5, 400K context"]
    C -->|"Simple (< 32K context)"| E["Local: --profile docker-fast<br/>GPT-OSS-20B"]
    C -->|"Medium (32K-64K)"| F["Local: --profile docker-local<br/>GPT-OSS-120B"]
    C -->|"Complex (> 64K)"| G["Local with compaction<br/>or accept trade-offs"]

Local models top out at 64K tokens of effective context for most hardware configurations⁸. Beyond that, you either accept degraded quality or switch to the cloud. The GPT-5.5 cloud path offers 400K tokens in Codex CLI and up to 1M via the API¹.

Practical Decision Framework

Scenario	Recommended Profile	Rationale
Proprietary algorithm refactoring	`docker-local`	Code never leaves the machine
Quick test generation	`docker-fast`	Low latency, no cost
Multi-file migration (> 20 files)	`cloud`	Needs large context window
Air-gapped environment	`docker-local`	No network required
Security audit with GPT-5.2-Codex	`cloud` (GPT-5.2-Codex)	Purpose-built for cybersecurity⁹
Weekend personal project	`docker-fast`	Zero API spend

Performance Tuning

Context Window Configuration

DMR defaults to conservative context sizes. For coding work, increase the context window in your Docker Compose configuration or via the DMR API:

# Check current context configuration
docker model inspect ai/gpt-oss:120b

Codex CLI recommends at least 64K tokens for effective agent operation⁸. On Apple Silicon Macs with 64 GB+ unified memory, the 120B model comfortably serves 64K context. On 32 GB machines, the 20B model is a better fit.

GPU Backend Selection

DMR automatically selects the best available backend, but you can influence the choice:

Apple Silicon: Metal is used automatically — no configuration needed
NVIDIA: Ensure CUDA drivers are installed; enable GPU inference in Docker Desktop settings
AMD: ROCm support on Linux; requires compatible drivers
Other GPUs: Vulkan backend provides broad compatibility²

Model Unloading

DMR unloads models from memory when idle, freeing resources for other work². This is particularly useful when switching between local and cloud profiles — the model is not consuming VRAM while you are using the cloud path.

Security Considerations

CVE-2026-33990

In April 2026, Docker patched CVE-2026-33990, an SSRF vulnerability in DMR’s OCI Registry Client¹⁰. Update Docker Desktop to the latest version to ensure you have the fix. Run docker model version and verify you are on a patched release.

Network Isolation

DMR serves on localhost by default. The TCP endpoint (port 12434) is not exposed to external networks unless explicitly configured⁴. For additional hardening in enterprise environments:

# In Codex CLI config.toml — restrict to loopback
[model_providers.docker-runner]
name = "Docker Model Runner (local only)"
base_url = "http://127.0.0.1:12434/engines/v1"

Code Privacy

The primary motivation for local inference is keeping proprietary code off third-party servers. With DMR, inference happens entirely on your hardware. No telemetry is sent to Docker or model publishers during inference².

Comparison with Ollama

Both Ollama and DMR serve as local model providers for Codex CLI. The choice depends on your existing toolchain:

Choose DMR when:

Docker Desktop is already part of your workflow
You want OCI-native model management (push/pull/tag like container images)
You need vLLM for higher-throughput production-like inference
You want first-party Codex CLI skills via docker model skills³
You need Vulkan GPU support for non-NVIDIA hardware

Choose Ollama when:

You prefer a standalone, minimal installation
You use ollama launch codex for zero-config startup⁸
Your team has standardised on Ollama’s model library
You want the built-in --oss flag without custom provider configuration

Both can coexist. You can define providers for both and switch via profiles:

[model_providers.docker-runner]
name = "Docker Model Runner"
base_url = "http://localhost:12434/engines/v1"

[model_providers.local_ollama]
name = "Ollama"
base_url = "http://localhost:11434/v1"

[profiles.dmr]
model_provider = "docker-runner"
model = "ai/gpt-oss:120b"

[profiles.ollama]
model_provider = "local_ollama"
model = "gpt-oss:120b"

Limitations

Context window ceiling: Local models are constrained by available RAM/VRAM. The practical ceiling is 64K tokens for most developer hardware, versus 400K for GPT-5.5 in the cloud¹⁸.
No server-side compaction: Codex CLI’s compaction endpoint is an OpenAI Responses API feature. Local providers cannot use it, so long sessions degrade faster¹.
No prompt caching: OpenAI’s prompt caching is server-side. Local inference pays full compute cost on every request.
Tool-calling fidelity: GPT-OSS models support Codex’s tool protocol, but community models (Qwen, Gemma, DeepSeek) may have lower tool-calling reliability, particularly with apply_patch⁶.
wire_api compatibility: DMR serves the Chat Completions API. Codex CLI’s built-in OpenAI provider uses the Responses API (wire_api = "responses"). Custom providers default to "openai" (Chat Completions), which works with DMR but lacks some Responses API features⁷.

Putting It All Together

A complete ~/.codex/config.toml for a hybrid local-cloud workflow:

# Default: cloud model
model = "gpt-5.5"

# Docker Model Runner provider
[model_providers.docker-runner]
name = "Docker Model Runner"
base_url = "http://localhost:12434/engines/v1"

# Profiles
[profiles.private]
model_provider = "docker-runner"
model = "ai/gpt-oss:120b"
reasoning_effort = "medium"

[profiles.quick]
model_provider = "docker-runner"
model = "ai/gpt-oss:20b"
reasoning_effort = "low"

[profiles.deep]
model = "gpt-5.5"
reasoning_effort = "high"

Daily workflow:

# Morning: quick local test generation on private code
codex --profile quick "Generate edge-case tests for src/billing/"

# Afternoon: private refactoring session
codex --profile private "Extract the payment gateway into a separate module"

# Complex migration: switch to cloud
codex --profile deep "Migrate the REST API from v2 to v3 spec"

Citations

OpenAI, “Models — Codex,” April 2026. https://developers.openai.com/codex/models ↩ ↩² ↩³ ↩⁴
Docker, “Docker Model Runner,” April 2026. https://docs.docker.com/ai/model-runner/ ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹
Docker, “docker model skills — Install Docker Model Runner skills for AI coding assistants,” April 2026. https://docs.docker.com/reference/cli/docker/model/skills/ ↩ ↩² ↩³ ↩⁴
Docker, “DMR REST API,” April 2026. https://docs.docker.com/ai/model-runner/api-reference/ ↩ ↩² ↩³
Docker, “Get started with DMR,” April 2026. https://docs.docker.com/ai/model-runner/get-started/ ↩
OpenAI, “Open-Weight Models for Codex CLI,” April 2026. https://developers.openai.com/codex/models ↩ ↩²
OpenAI, “Advanced Configuration — Codex,” April 2026. https://developers.openai.com/codex/config-advanced ↩ ↩²
Ollama, “Codex — Ollama Integration,” April 2026. https://docs.ollama.com/integrations/codex ↩ ↩² ↩³ ↩⁴
OpenAI, “Introducing GPT-5.2-Codex,” 28 April 2026. https://openai.com/index/introducing-gpt-5-2-codex/ ↩
Docker, “Docker Desktop Release Notes — April 2026,” April 2026. https://releasebot.io/updates/docker/docker-desktop ↩

Codex CLI and Docker Model Runner: Containerised Local Inference for Private, Cost-Free Coding Agents

Introduction

Why Docker Model Runner?

Architecture

Setup

1. Enable Docker Model Runner

2. Pull a Coding Model

3. Test the Model

4. Install Codex CLI Skills

Codex CLI Configuration

Provider Definition

Profiles

The --oss Shortcut

Hybrid Workflow: Local for Privacy, Cloud for Power

Practical Decision Framework

Performance Tuning

Context Window Configuration

GPU Backend Selection

Model Unloading

Security Considerations

CVE-2026-33990

Network Isolation

Code Privacy

Comparison with Ollama

Limitations

Putting It All Together

Citations

The `--oss` Shortcut