Testing & Code Review
Articles on test strategies, benchmarks, code review workflows and quality assurance with Codex.
95 articles
The Human Review Bottleneck: Practical Code Review Strategies for Agent Output
AI coding agents have solved the wrong half of the problem. Teams using Codex CLI, Claude Code, and similar tools report generating 98% more pull requests.
Codex CLI in GitHub Actions: Best Practices, Limitations, and Gotchas
The openai/codex-action@v1 GitHub Action transforms Codex CLI from an interactive developer tool into a CI/CD workhorse — reviewing pull requests.
Codex CLI Session Patterns: A Decision Framework for Threads, Worktrees, /side, Goals, and Subagents
Codex CLI v0.133 ships with five distinct session patterns, each designed for a different shape of work. Choosing the wrong pattern does not break anything.
Spec-Driven Development Frameworks for Codex CLI: Patterns, Best Practices, and the 2026 Landscape
Spec-driven development has become the dominant methodology for AI-assisted coding in 2026.
Codex CLI Prompt Engineering in the GPT-5.5 Era: Outcome-First Patterns, Anti-Patterns, and the Prompts That Ship Code on the First Turn
The single most common question in the OpenAI developer forum is some variation of Why does Codex produce garbage for me but magic for everyone else? .
Gemini 3.5 Flash vs GPT-5.5 and codex-mini: Coding Model Benchmark Comparison After Google I/O 2026
Google I/O 2026 dropped Gemini 3.5 Flash on 19 May with a bold claim: it beats Gemini 3.1 Pro on coding benchmarks whilst running four times faster than.
Codex CLI for Consumer-Driven Contract Testing: Pact Generation, Provider Verification, and CI Contract Gates
Consumer-driven contract testing solves one of the thorniest problems in microservice architectures: how do you know your services are compatible before.
Grok Build Enters the Ring: How xAI's Parallel-Agent CLI Compares to Codex CLI
On 14 May 2026, Elon Musk posted a broad call for beta testers of Grok Build, xAI's first terminal-native coding agent. The tool enters a market dominated.
Coverage-Driven Test Generation with Codex CLI: Closing Gaps Using Istanbul, Coverage.py, and Agent Workflows
Every engineering team has coverage gaps — untested error handlers, edge-case branches nobody thought to exercise, and legacy modules with zero assertions.
Building Custom Code Review Pipelines with the Codex SDK: Structured Findings Across GitHub, GitLab, and Azure DevOps
Codex ships with built-in GitHub pull request review — enable it in settings and every PR gets an automatic @codex review pass .
Property-Based Testing and Fuzzing with Codex CLI: Agent-Driven Edge-Case Discovery Using Hypothesis and fast-check
Example-based unit tests verify the cases you thought of. Property-based tests verify the cases you didn't. The difference matters most in parsing.
GPT-5.3-Codex Deep Dive: Benchmarks, CLI Configuration, and Interactive Coding Workflows
GPT-5.3-Codex landed on 5 February 2026 as OpenAI's flagship coding model, promising industry-leading agentic performance alongside a 25 % speed improvement.
Codex CLI for Kubernetes Operator Development: Scaffolding CRDs, Writing Reconciliation Loops, and Testing with envtest
Building a Kubernetes operator is one of the most structurally demanding tasks in cloud-native Go development. You need a Custom Resource Definition that.
Google Antigravity vs Codex CLI: Multi-Agent IDE Meets Terminal-First Agent in the 2026 Coding Wars
Google Antigravity landed in public preview on 20 November 2025 and has since grown into the most serious IDE-native challenger to terminal-first agents.
How Developers Actually Configure Agentic Coding Tools: What 2,926 Repositories Reveal About the Codex CLI Adoption Gap
A new empirical study of nearly three thousand GitHub repositories has quantified something most Codex CLI practitioners have sensed intuitively.
Prompting GPT-5.5 in Codex CLI: Outcome-First Instructions, AGENTS.md Patterns, and Reasoning Effort Tuning
GPT-5.5 landed in Codex CLI in late April 2026 as OpenAI's newest frontier model, bringing stronger planning, tool use, and multi-step follow-through.
The AI Coding Agent Quality Crisis: What the Opsera and Sourcery Intel 2026 Reports Reveal — and How to Configure Codex CLI to Stay Ahead of the Data
Two major industry reports landed in early 2026 and painted a sobering picture: AI coding agents demonstrably accelerate delivery, but they also introduce.
Reviewing Agent Pull Requests: What 23,000 PRs Reveal About Description Accuracy and How to Configure Codex CLI for Trustworthy Contributions
More than one in five code reviews on GitHub now involves an AI coding agent . With Codex CLI recording 90 million installs in a single week and the broader.
ProgramBench and the Zero-Percent Problem: What a Cleanroom Benchmark Reveals About Architectural Reasoning in Codex CLI
On 5 May 2026, researchers from Meta Superintelligence Labs, Stanford, and Harvard published ProgramBench.
The Codex CLI Instruction Stack: How Six Configuration Surfaces Shape Agent Behaviour
Codex CLI does not read a single instruction file. It assembles a composite instruction set from six distinct surfaces, each with its own scope, precedence.
Codex CLI Official Workflow Recipes: Nine Patterns That Structure the Developer Loop
OpenAI's developer documentation now includes a dedicated Workflows page that codifies nine canonical patterns for using Codex CLI across the software.
Codex CLI for Ruby on Rails Teams: RuboCop MCP, RSpec Workflows, and Convention-Friendly AGENTS.md Patterns
Rails has always been opinionated about structure. Models live in app/models/, controllers in app/controllers/, views in app/views/.
PRDBench and the PRD-to-Code Gap: Why Building From Specs Is Harder Than Fixing Bugs
Most coding agent benchmarks ask a deceptively narrow question: can the agent fix this bug? SWE-bench and its variants hand the model a failing test and a.
ProdCodeBench and Production-Derived Evaluation: Why Synthetic Benchmarks Mislead and How to Evaluate Codex CLI Against Real Workloads
Most teams selecting a coding agent rely on public leaderboards — SWE-bench Verified, Terminal-Bench 2.0, Aider Polyglot — to inform their choice. These.
Codex CLI for Visual Regression Testing: Integrating Percy, Chromatic, and Playwright via MCP
Visual regression testing — the practice of capturing screenshots and comparing them pixel-by-pixel against approved baselines — has traditionally required.
Codex CLI Skills for OSS Maintenance: Lessons from OpenAI's Own Agents SDK Repositories
OpenAI practises what it preaches. In March 2026 the company published a detailed case study showing how Codex CLI skills transformed maintenance of its two.
Terminal Agent Showdown: Codex CLI vs Claude Code vs Gemini CLI in May 2026
The terminal agent race has intensified since the three-way contest crystallised in late 2025. OpenAI's Codex CLI (v0.128.0, Rust-native), Anthropic's.
Anatomy of a Production AGENTS.md: What the openai/codex Repository Teaches About Agent-Aware Codebase Configuration
Most AGENTS.md guides tell you what sections to include. Few show you a battle-tested file from a codebase where agents write production code daily.
Codex CLI Multi-File Editing Strategies: Coordinating Changes Across Large Pull Requests with apply_patch and Subagents
Every senior developer knows the pain: a rename that touches forty files, an API migration that ripples through three service boundaries, a framework.
Codex CLI Daily Driver Setup for May 2026: An Opinionated Configuration Guide
Codex CLI v0.128 is the most configurable release yet. Between named profiles, persistent memories, configurable keymaps, goal workflows.
Specification Drift and SLUMP: Why Codex CLI Loses Faithfulness in Long-Horizon Sessions and How to Fight Back
Every developer who has used a coding agent for a multi-hour session has felt it: somewhere around the thirtieth turn, the agent starts building something.
The Code Review Agent Benchmark: What CR-bench Reveals and How to Configure Codex CLI for Higher-Quality Reviews
Every team that has enabled automated code review — whether through Codex's GitHub integration, Claude Code, Devin, or the open-source PR-Agent.
Do Agent-Written Tests Actually Help? What Six LLMs on SWE-bench Reveal and How to Rethink Your Codex CLI Testing Strategy
The instinct to make coding agents write tests is strong — and understandable. Test-driven development has been a pillar of professional software.
The Over-Mocking Problem: What 1.2 Million Commits Reveal About Agent-Generated Tests and How to Configure Codex CLI for Realistic Test Output
A new empirical study accepted at MSR 2026 analysed 1.2 million commits across 2,168 repositories and found that coding agents generate mocks in 36% of their.
Agent-Generated Code Churns Faster: What 110,000 Pull Requests Reveal and How to Configure Codex CLI for Durable Output
A new MSR 2026 study of 110,000 open-source pull requests across five coding agents finds that agent-generated code is rewritten and deleted significantly.
The AI Coding Productivity Paradox: What Three Major Studies Reveal and How to Configure Codex CLI for Genuine Speed Gains
Ninety-three per cent of developers now use AI coding tools. Adoption is near-universal. Yet three independent research programmes — METR's randomised.
Agentic Harness Engineering: What Observability-Driven Evolution Means for Your Codex CLI Configuration
A paper published on 29 April 2026 by Lin et al. introduces Agentic Harness Engineering (AHE), a closed-loop framework that automatically evolves.
Interaction Smells in Codex CLI Sessions: Recognising and Fixing Multi-Turn Prompt Anti-Patterns
Every senior developer knows about code smells — structural patterns that hint at deeper problems. A March 2026 empirical study from Zhang et al. introduces.
Agent Psychometrics: Predicting Which Tasks Your Codex CLI Agent Will Ace and Which It Will Botch
Not every coding task is created equal, and neither is every agent. A new framework out of the ICLR 2026 Workshop on Agents in the Wild formalises something.
GPT-5.2-Codex: What the New Agentic Coding Model Means for Your Codex CLI Workflows
On 28 April 2026, OpenAI released GPT-5.2-Codex — a variant of GPT-5.2 purpose-built for agentic coding workflows . Unlike GPT-5.5, which targets breadth.
Self-Hosted Code Review Pipelines with Codex CLI: Structured Output Across GitHub Actions, GitLab CI, Azure DevOps, and Jenkins
Codex Cloud's built-in PR review is convenient if your team lives on GitHub. But enterprise teams running GitLab, Azure Repos, Bitbucket, or on-premises.
Task Decomposition for Codex CLI: Right-Sizing Agent Work for Reliability, Speed, and Cost
The single biggest determinant of whether a Codex CLI session succeeds or spirals into wasted tokens is not the model you pick.
Evaluation Exploitation in Codex CLI Workflows: Why Your Agent Games the Score and How to Stop It
Yesterday's article on scored improvement loops showed how Codex CLI can iterate autonomously against an evaluation harness until quantitative and.
Git Hooks Powered by Codex CLI: Pre-Commit Review, Commit Message Generation, and Pre-Push Validation
Git hooks are the last line of defence before code leaves your machine. Most teams wire them up to linters, formatters, and type-checkers — fast.
Codex CLI for Flutter and Dart Teams: MCP Server, DCM, and Agent-Driven Cross-Platform Development
Flutters widget-based architecture, Darts strong type system, and the frameworks rapid feedback loop (hot reload.
Codex CLI for Open Source Maintainers: Issue Triage, PR Review, and Contributor Automation at Scale
Open source maintainers face a compounding problem: issue volumes grow faster than review capacity. A popular project with 50 open issues per week and three.
Codex CLI for GraphQL Development: Apollo MCP Server, Schema-First Workflows, and Type-Safe Agent Patterns
GraphQL APIs occupy an unusual position in the coding agent landscape. The typed schema, introspection capabilities, and operation-level granularity that.
Test-Driven Development with Codex CLI: Agent-Driven Red-Green-Refactor Workflows
The single most reliable technique for getting consistently correct output from a coding agent is also one of the oldest ideas in software engineering.
Error Recovery and Rollback Patterns for Codex CLI: Git Safety Nets for Agentic Workflows
Coding agents move fast. A single Codex CLI session can touch thirty files in under a minute, and when something goes wrong.
Debugging with Codex CLI: Systematic Bug-Hunting Patterns for GPT-5.5
Debugging is one of the highest-leverage uses of Codex CLI, yet most practitioners treat it as an afterthought.
Codex CLI for Django and FastAPI Teams: AGENTS.md Templates, Sandbox Configuration, and Python Web Development Workflows
Python web frameworks remain the backbone of backend development for millions of teams, yet Codex CLI's documentation and community guides lean heavily.
Codex CLI for PHP and Laravel Teams: Boost MCP, Pest Workflows, and Composer Sandbox Patterns
PHP powers roughly 75% of websites with a known server-side language, and Laravel remains the dominant framework — Laravel 13 shipped on 17 March 2026 with.
Codex CLI for React Native and Expo: First-Party Skills, Plugins, and Mobile Development Workflows
React Native and Expo have always attracted developers who want to move fast. In 2026, that ethos extends to AI-assisted development.
Community Workflow Frameworks for Codex CLI: Superpowers, GSD, gstack, Spec Kit, OMX, and Compound Engineering Compared
Codex CLI ships with a deliberately minimal orchestration layer: an agent loop, a sandbox, hooks, and skills.
DeepSeek V4 as a Codex CLI Provider: Frontier-Class Coding at a Fraction of the Cost
DeepSeek V4 landed today — 24 April 2026 — and the numbers deserve attention. V4-Pro scores 80.6% on SWE-bench Verified while charging $3.48 per million.
GPT-5.5 Drops: What Changes for Codex Users
Six weeks. That is the gap between GPT-5.4 and GPT-5.5. OpenAI released its newest frontier model on 23 April 2026, rolling it out simultaneously to ChatGPT.
Contract-Driven API Development with Codex CLI: Using Specmatic MCP for Spec-First Full-Stack Builds
Most agentic coding workflows suffer from the same failure mode: the agent generates code that compiles, passes its own tests.
The AI Codebase Maturity Model: Mapping Five Levels of Agent Autonomy to Codex CLI
Most teams plateau at prompt-and-review. They install Codex CLI, generate a few fixes, manually inspect the diffs.
Benchmarking Your Agentic Pod: What CocoaBench, HiL-Bench, and AAR Tell Us About Agent Limits
Three benchmarks published in April 2026 expose where frontier coding agents actually break down — and the failure modes they reveal map directly onto.
Beyond SWE-bench: Why AI Coding Benchmarks Are Broken and What It Means for Codex CLI Workflows
In April 2026, the AI coding agent ecosystem relies heavily on benchmark scores to signal capability. Marketing pages trumpet SWE-bench Verified.
The Harness Effect: Why the Same Model Scores 16 Points Higher in a Different Tool
Claude Opus running inside Cursor scores 93% on Terminal-Bench 2.0. The same model running inside Claude Code scores 77%. That is a 16-point differential.
Why Coding Agents Fail at Navigation (and How AGENTS.md File Maps Fix It)
Your coding agent can refactor a function, write tests, and call APIs — but ask it to find the right file in a monorepo.
Before and After: 5 Developer Workflows Transformed by Codex CLI
Every developer has workflows they endure rather than enjoy — the 45-minute bug-fix cycle, the mind-numbing PR review backlog, the test coverage debt that.
Why Code Review Agents Produce 60% Noise — and How to Configure Codex CLI Reviews That Don't
A new empirical study accepted at MSR 2026 has quantified what many practitioners already suspected: most code review agents produce predominantly noisy.
What 33,000 Agentic Pull Requests Reveal: Empirical Lessons for Codex CLI Practitioners
AI coding agents are no longer experimental curiosities — they now submit hundreds of thousands of pull requests to real repositories every month.
Codex CLI SWE-Bench Scores and Benchmark Results Explained
OpenAI's Codex models consistently top the SWE-Bench leaderboards, but what do those numbers actually mean? This article breaks down the benchmark variants.
The Automated Review-Fix Loop: CodeRabbit, Cross-Provider Review, and Closing the Quality Gap in Agent-Generated Code
Agent-generated code ships fast, but quality remains the bottleneck. The Hacker News consensus on Codex is blunt: the code can be quite sloppy and.
Codex App Computer Use: Background GUI Automation on macOS Without Surrendering Your Desktop
On 16 April 2026, OpenAI shipped Computer Use in the Codex desktop app (version 26.415), enabling agents to operate macOS applications by seeing the screen.
The Security Decisions AI Agents Make: What Codex and Claude Code Miss When You Don't Ask
Every time you prompt Codex or Claude Code to build me a web app, the agent silently makes dozens of security decisions on your behalf.
Codex as a GitHub Coding Agent: Agent HQ, Model Selection, and Cloud-Based Code Review
Most coverage of Codex focuses on the CLI — the open-source terminal agent you install with npm install -g @openai/codex.
TDAD and Graph-Based Test Impact Analysis: Cutting Codex CLI Regressions by 70%
Autonomous coding agents resolve issues faster than most developers expect. What they also do — with uncomfortable regularity — is break things that already.
What the ETH Zurich Paper Gets Wrong (and Right) About AGENTS.md
In February 2026, researchers at ETH Zurich published a paper that sent shockwaves through the AI-assisted development community: Evaluating AGENTS.md.
Testing Codex CLI Skills: The Official Eval Pipeline with codex exec, JSONL Traces, and Skillgrade
Skills are becoming the primary unit of reusable workflow in Codex CLI. But a skill without evaluation is a guess — you have no idea whether a SKILL.md.
Gemma 4 on Codex CLI vs Claude Code: Same Model, Different Results
Joe Njenga recently documented his experience running Gemma 4 with Claude Code. I spent the same week running Gemma 4 with Codex CLI on two machines.
The Official Codex CLI Best Practices Decoded: OpenAI's Six-Stage Workflow Maturity Model
OpenAI recently published a canonical best practices guide at developers.openai.com/codex/learn/best-practices.
Evaluating Codex CLI Agents with Promptfoo: Trajectory Assertions, Cost Guards, and Structured Grading
Standard LLM evals check whether a model returns the right text. Agent evals are a different beast entirely: two agents can produce identical final outputs.
From Codex to GPT-5.4: The Complete History of OpenAI's Code Models
In July 2021, OpenAI published a paper describing a GPT-3 model fine-tuned on 159 gigabytes of Python code from 54 million GitHub repositories. They called.
Test-Driven Development with Codex CLI: The Red-Green-Refactor Loop, AGENTS.md Test Gates, and Hook-Based Verification
The TDD AI agent pattern has emerged as the most reliable way to execute autonomous coding in 2026.
Harness Performance on Terminal-Bench: Why Scaffolding Matters More Than Model Choice
Terminal-Bench 2.0 has become the definitive benchmark for evaluating AI coding agents in realistic terminal environments . Published at ICLR 2026, it tests.
Skill Creator V2 and Codex CLI: Scientific Skill Improvement Without the Token Bill
Anthropic's Skill Creator V2 — available at skills.sh — promises scientific evaluation of agent skills. It launches parallel sub-agent executions, runs.
Codex CLI vs Claude Code Multi-Agent: Subagents, Agent Teams and the Protocol Gap
The two dominant terminal-native coding agents — OpenAI's Codex CLI and Anthropic's Claude Code — have each shipped multi-agent capabilities, but with.
Tessl Skill Evaluation Framework: Treating Agent Skills as Production Software
You have written a skill for your coding agent. It looks right. It seems to work when you try it.
Automating the Cross-Model Review Loop: Three Levels from SKILL.md to Multi-AI Pipeline
The cross-model review pattern — where one AI writes code and a structurally different AI reviews it — has become a core quality practice in agentic.
Codex Cloud Exec Best-of-N: Running Multiple Solution Attempts and Picking the Winner
One of the quieter but most impactful features in Codex CLI's cloud offering is the --attempts flag on codex cloud exec. First shipped in the June 2025.
Codex CLI Code Review Workflows: /review, review_model, and the MCP Extension
The /review command is one of Codex CLI's most practical daily-use features, yet it receives surprisingly little attention compared to the agent.
Spec-Driven Development with Codex: Writing Specifications Before Code
Test-Driven Development (TDD) tells the agent when it is done. Spec-Driven Development (SDD) tells it what to build in the first place. The two approaches.
Cross-Model Adversarial Review: Using Multiple AI Models to Catch Agent Blind Spots
The moment your coding agent reviews its own output, you have a problem. Not because the agent is dishonest.
Evaluating Codex Agents: Evals, Long-Horizon Benchmarks, and the 4-File Pattern
How to evaluate whether your Codex agent actually did the right thing — from quick skill evals to 25-hour autonomous runs.
Test-First Development with Codex: Using TDD as the Agent Feedback Loop
The single biggest problem with autonomous agents is knowing when theyre done. A human developer can feel when code feels right. An agent cannot.
Planning Mode in Practice: When to Use It and When to Skip It
Most developers activate planning mode once, see an agent propose a numbered list, and then leave it on permanently — or switch it off after a frustrating.
Codex CLI Automatic Code Review: PR Integration and Pre-Commit Workflows
Code review is where most AI coding tools stop short. Codex CLI closes the loop by providing automated review at every stage of the Git workflow.
Compound Engineering with Codex: The 80/20 Plan-Review Model
Based on: every.to/guides/compound-engineering · github.com/EveryInc/compound-engineering-plugin · notes/compound-engineering.md
The Proof of Work Principle: Why Agents Need to Show Their Working
There is a habit that developers have fallen into when working with autonomous coding agents: they read the diff, nod, and merge. The code arrived from.
Staying Engaged with Your Codebase in an Agentic World
There is a specific feeling that sets in about two weeks after you start delegating heavily to Codex.
Codex CLI in Practice: Real-World Benchmarks and What They Mean
Benchmark numbers dominate marketing copy, but most developers lack the context to interpret them critically.