N-Version Programming with Coding Agents: What the Knight–Leveson Replication Means for Codex CLI Multi-Model Resilience
N-Version Programming with Coding Agents: What the Knight–Leveson Replication Means for Codex CLI Multi-Model Resilience
The Thirty-Year Experiment, Replayed with Agents
In 1986, Knight and Leveson ran a landmark experiment at the University of Virginia and the University of California, Irvine: twenty-seven programmers independently implemented the same Launch Interceptor Program (LIP) specification, and the results challenged the fundamental assumption that independent development yields statistically independent failures 1. Forty years later, Ron, Baudry, and Monperrus at KTH Royal Institute of Technology have replayed that experiment — replacing human programmers with contemporary AI coding agents 2.
The findings are directly relevant to anyone using Codex CLI in production: diversity helps, but shared training corpora create tighter coupling than shared university curricula ever did.
The Experiment
The researchers took the original Knight–Leveson LIP specification — a four-stage computation comprising fifteen Boolean Launch Interceptor Conditions (LICs), a Preliminary Unlocking Matrix, a Final Unlocking Vector, and a launch decision 1 — and fed it to five agent systems across twenty-three models and three target languages 2.
Agent Systems and Models
| Agent System | Models | Admitted |
|---|---|---|
| Claude Code | Opus 4.6, Opus 4.5, Sonnet 4.6, Sonnet 4.5, Haiku 4.5 | 13/15 (87%) |
| OpenAI Codex | GPT-5.4, 5.4-mini, 5.3-codex, 5.2-codex, 5.2 | 11/15 (73%) |
| Gemini | 3.1-pro-preview, 3-flash-preview, 2.5-pro, 2.5-flash, 2.5-flash-lite | 8/15 (53%) |
| Cursor | composer-2.5, composer-2 | 6/6 (100%) |
| OpenCode | Qwen variants, Gemma variants | 10/18 (56%) |
Target languages were Pascal, Python, and Rust. Of 69 attempted [harness, model, language] triples, 48 passed the 200-case acceptance filter — a 70% admission rate 2.
The Test Campaign
Following the original protocol exactly, each admitted implementation was tested against one million randomised inputs using a shared oracle 2. This is not a toy benchmark — it is the most thorough diversity evaluation ever conducted on AI-generated code.
Results: Diversity Helps, But Common-Mode Failure Persists
The Good News
Across all possible three-version majority-voting units (C(48,3) = 17,296 triples), the mean failure count dropped from 387.44 for single versions to 130.99 for triples — a 66.2% reduction 2. More striking: 11,844 triples (68.48%) exhibited zero observed failures across one million test inputs, compared with 27 single versions (56.25%) achieving zero failures 2.
The upper-tail compression is dramatic:
| Percentile | Single Version | Triple (Majority Vote) |
|---|---|---|
| P95 | 429 | 419 |
| P99 | 6,004 | 419 |
| Maximum | 10,469 | 419 |
The Bad News
The Knight–Leveson z-statistic — which tests whether coincident failures exceed what statistical independence would predict — came in at z = 29.20 with a p-value of approximately 1.765 × 10⁻¹⁸⁷ 2. Independence is comprehensively rejected. The common-mode failure factor was 3.7×: 429 observed coincident failures against 115.36 expected under independence 2.
Per-language analysis reveals the problem intensifies with less common training data:
| Language | z-statistic | K/μ ratio |
|---|---|---|
| Python | 80.69 | 17.5× |
| Rust | 186.93 | 155.1× |
| Pascal | 253.30 | Highest |
Python, with the richest training corpus, shows the weakest coupling. Pascal, with the thinnest, shows the strongest 2.
Where Failures Concentrate
Failures clustered around two specification ambiguities 2:
-
LICs #9 and #14 (Minimum Enclosing Circle): Most implementations incorrectly computed the circumcircle instead of the minimum enclosing circle for three-point predicates — a subtle geometric distinction that the specification defines but does not emphasise.
-
LICs #3 and #10 (Angle Predicates): Ambiguity between interior and complementary angle representation, compounded by relative-tolerance comparison semantics.
The pairwise correlation analysis is sobering: of 146 pairs with defined Pearson φ correlation, 81 showed perfect correlation (φ = 1) — identical failure patterns across language boundaries 2. Language and agent diversity did not eliminate correlated failures rooted in specification ambiguity.
Why This Matters More for Agents Than for Humans
The original Knight–Leveson experiment challenged N-version programming orthodoxy by showing human programmers fail dependently. But the agent replication reveals an amplification of this effect: shared training corpora create even tighter coupling than shared curricula 2.
This has a precise mechanism. When five agents all train on (or distil from) overlapping code corpora, they inherit the same algorithmic biases. The circumcircle-versus-minimum-enclosing-circle confusion is not random — it reflects a systematic gap in how this geometric concept is represented across GitHub, Stack Overflow, and textbook code.
flowchart TB
subgraph "Shared Training Corpora"
TC[Overlapping Code Datasets]
end
subgraph "Agent Systems"
A1[Claude Code]
A2[Codex]
A3[Gemini]
A4[Cursor]
A5[OpenCode]
end
subgraph "Common-Mode Failures"
F1[LIC #9/#14: Circumcircle vs MEC]
F2[LIC #3/#10: Angle Ambiguity]
end
TC --> A1 & A2 & A3 & A4 & A5
A1 & A2 & A3 & A4 & A5 --> F1 & F2
subgraph "Mitigation"
MV[Majority Voting]
SP[Specification Hardening]
end
F1 & F2 --> MV
F1 & F2 --> SP
style TC fill:#f9d71c,color:#000
style MV fill:#4caf50,color:#fff
style SP fill:#2196f3,color:#fff
Mapping to Codex CLI: Practical N-Version Patterns
The research validates a specific engineering strategy: generate diverse implementations and vote. Codex CLI provides the primitives to operationalise this.
Pattern 1: Multi-Model codex exec with Majority Voting
Use named profiles in config.toml to define model-specific configurations, then script parallel generation:
# ~/.config/codex/config.toml
[profiles.diverse-gpt]
model = "gpt-5.4"
[profiles.diverse-sonnet]
model = "sonnet-4.6"
provider = "anthropic"
[profiles.diverse-gemini]
model = "gemini-3.1-pro-preview"
provider = "google"
Then generate three implementations in parallel:
#!/usr/bin/env bash
# n-version-generate.sh — Generate diverse implementations
SPEC="Implement the following specification exactly: $(cat spec.md)"
for profile in diverse-gpt diverse-sonnet diverse-gemini; do
codex exec --profile "$profile" \
--output-dir "versions/${profile}" \
"$SPEC" &
done
wait
echo "Three versions generated. Running oracle tests..."
The research shows cross-agent diversity (different harnesses, not just different models) produces slightly less correlated failures than same-agent diversity 2. Using Claude Code, Codex, and Gemini through their respective providers maximises decorrelation.
Pattern 2: Cloud --attempts for Best-of-N
For simpler N-version setups within a single model family, codex cloud exec --attempts N generates multiple candidates and selects the best 3:
codex cloud exec --env my-env --attempts 3 \
"Implement the date parser from spec.md with full edge-case coverage"
# Review candidates, apply the best
codex cloud apply TASK_ID --attempt 2
This is single-model diversity — less effective than cross-model diversity according to the paper’s pairwise correlation data, but still captures the 66% failure reduction from voting 2.
Pattern 3: Language Diversity via AGENTS.md
The paper shows Python implementations have the weakest common-mode coupling (z = 80.69) while Pascal shows the strongest (z = 253.30) 2. For safety-critical components, specify multiple target languages:
<!-- AGENTS.md -->
## N-Version Protocol
For any module in `critical/`:
1. Generate the primary implementation in Python
2. Generate a verification implementation in Rust
3. Both must pass the shared oracle in `tests/oracle/`
4. Discrepancies trigger human review before merge
Do NOT use the same algorithmic approach in both versions.
Explicitly instruct: "Use a different algorithm or library than the Python version."
Pattern 4: PostToolUse Voting Gate
Automate the voting step with a PostToolUse hook that runs all N versions against a shared test oracle and blocks the commit if they disagree:
#!/usr/bin/env bash
# .codex/hooks/post-tool-use-vote.sh
if [[ "$TOOL_NAME" == "write_file" && "$FILE_PATH" == critical/* ]]; then
PASS_COUNT=0
for version_dir in versions/*/; do
if pytest "tests/oracle/" --impl="${version_dir}/${FILE_PATH}" --quiet; then
((PASS_COUNT++))
fi
done
TOTAL=$(ls -d versions/*/ | wc -l)
MAJORITY=$(( (TOTAL / 2) + 1 ))
if [[ $PASS_COUNT -lt $MAJORITY ]]; then
echo "BLOCK: Only ${PASS_COUNT}/${TOTAL} versions pass oracle. Majority required: ${MAJORITY}"
exit 1
fi
fi
Specification Hardening: The Real Defence
The paper’s most actionable insight is not about voting — it is about where failures concentrate. Both the 1986 and 2026 experiments trace the majority of common-mode failures to specification ambiguity, not implementation difficulty 12.
For Codex CLI workflows, this means:
-
Encode edge cases explicitly in AGENTS.md. Do not assume the model will infer that “minimum enclosing circle” differs from “circumscribed circle.” Spell out the distinction with examples.
-
Provide oracle tests alongside specifications. The paper’s 200-case acceptance filter caught 30% of implementations before the main test campaign. Your
AGENTS.mdshould reference test files:"All implementations must pass tests/oracle/ before proceeding."2 -
Flag known ambiguity zones. The paper identifies geometric predicates and tolerance comparisons as systematic failure attractors 2. In your domain, catalogue the equivalent traps — date parsing edge cases, timezone handling, Unicode normalisation — and encode them as explicit test cases.
flowchart LR
subgraph "Specification Layer"
S1[AGENTS.md with<br/>explicit edge cases]
S2[Oracle test suite<br/>in tests/oracle/]
S3[Known ambiguity<br/>catalogue]
end
subgraph "Generation Layer"
G1[Model A via<br/>Profile 1]
G2[Model B via<br/>Profile 2]
G3[Model C via<br/>Profile 3]
end
subgraph "Voting Layer"
V1[PostToolUse<br/>voting gate]
V2[Majority decision]
end
S1 & S2 & S3 --> G1 & G2 & G3
G1 & G2 & G3 --> V1
V1 --> V2
style S1 fill:#2196f3,color:#fff
style V2 fill:#4caf50,color:#fff
Limitations and Open Questions
The paper acknowledges several constraints that practitioners should weigh 2:
- Cost scaling. Three-version generation triples token consumption. The paper consumed 48 full
codex exec-equivalent runs for a single specification. For routine development, reserve N-version patterns for safety-critical or high-value components. ⚠️ - Oracle dependency. Majority voting requires a shared oracle or at minimum a shared test suite. Without one, you are voting on outputs you cannot evaluate. ⚠️
- Specification coverage. The 200-case acceptance filter admitted 70% of implementations, but the remaining failures all traced to specification gaps the filter did not cover. Acceptance testing is necessary but not sufficient 2.
- Model lineage coupling. Many “different” models share training data, fine-tuning pipelines, or distillation parents. GPT-5.4-mini and GPT-5.4, for instance, showed tighter failure correlation than GPT-5.4 and Sonnet 4.6 2. True diversity requires cross-provider, cross-architecture generation.
The Bottom Line
Ron, Baudry, and Monperrus have provided the strongest evidence to date that N-version programming with coding agents is a useful engineering strategy — but with caveats that map directly to how you configure Codex CLI 2:
- Cross-model diversity beats same-model repetition. Use named profiles pointing to different providers.
- 66% failure reduction is real. Three-version majority voting compresses the failure distribution dramatically.
- Common-mode failures are systematic, not random. They trace to specification ambiguity and shared training data, not to stochastic sampling.
- Specification hardening is more effective than adding versions. Fix the spec before scaling the vote.
For Codex CLI users, the practical takeaway is: for any code path where correctness matters more than speed, generate at least three implementations using different models, run them against a shared oracle, and vote. The infrastructure — named profiles, codex exec, PostToolUse hooks — already exists. The research now confirms it works.
Citations
-
Knight, J.C. and Leveson, N.G. (1986). “An Experimental Evaluation of the Assumption of Independence in Multiversion Programming.” IEEE Transactions on Software Engineering, SE-12(1), pp. 96–109. https://ieeexplore.ieee.org/document/1702140 ↩ ↩2 ↩3
-
Ron, J., Baudry, B., and Monperrus, M. (2026). “N-Version Programming with Coding Agents.” arXiv:2606.20158. https://arxiv.org/abs/2606.20158 ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11 ↩12 ↩13 ↩14 ↩15 ↩16 ↩17 ↩18 ↩19 ↩20 ↩21 ↩22
-
OpenAI (2026). “Codex CLI Features — Exec Mode.” OpenAI Developer Documentation. https://developers.openai.com/codex/cli/features ↩