Codex CLI for Data Engineering: dbt, Airflow, and Pipeline Generation
Codex CLI for Data Engineering: dbt, Airflow, and Pipeline Generation
Data engineering workflows—building dbt models, orchestrating Airflow DAGs, validating data quality—are repetitive, context-heavy, and ripe for agentic automation. With the dbt MCP server, dbt agent skills, and Codex CLI’s non-interactive execution mode, you can transform a generalist coding agent into a capable data engineering partner that understands your warehouse schema, lineage, and transformation conventions.
This article covers the practical integration of Codex CLI with dbt Core 1.111, Apache Airflow 32, and the dbt MCP server3 for automated pipeline generation, testing, and review.
Architecture Overview
graph TB
subgraph "Developer Workstation"
A[Codex CLI] --> B[dbt MCP Server]
A --> C[dbt Agent Skills]
A --> D[AGENTS.md]
end
subgraph "dbt Project"
B --> E[Models]
B --> F[Sources]
B --> G[Tests]
B --> H[Semantic Layer]
end
subgraph "Orchestration"
A --> I[Airflow DAGs]
I --> J[dbt Operators]
J --> E
end
subgraph "CI/CD"
K[codex exec] --> L[Data Quality Gates]
K --> M[Schema Validation]
K --> N[Lineage Impact Check]
end
Setting Up the dbt MCP Server
The dbt MCP server exposes your project’s models, sources, macros, tests, and lineage to Codex CLI via the Model Context Protocol3. Two deployment modes exist:
Local MCP (dbt Core / dbt Fusion)
The local server runs alongside your dbt project and requires no platform account:
# Install and run
uvx dbt-mcp
Configure in your Codex CLI config (~/.codex/config.toml):
[[mcp_servers]]
name = "dbt"
command = "uvx"
args = ["dbt-mcp"]
[mcp_servers.env]
DBT_MCP_ENABLE_DISCOVERY = "true"
DBT_MCP_ENABLE_EXECUTE_SQL = "true"
DBT_MCP_ENABLE_SEMANTIC_LAYER = "true"
Remote MCP (dbt Platform)
For teams using dbt Cloud, the remote server provides hosted access without local installation4:
[[mcp_servers]]
name = "dbt-remote"
type = "http"
url = "https://mcp.cloud.getdbt.com"
[mcp_servers.env]
DBT_CLOUD_API_KEY = "${DBT_CLOUD_API_KEY}"
DBT_MCP_ENVIRONMENT_ID = "12345"
Controlling Exposed Toolsets
Use DBT_MCP_ENABLE_* environment variables to restrict which capabilities the agent can access3. For model generation workflows, enable discovery and SQL execution; for read-only review, disable execute and limit to discovery and semantic layer access.
Installing dbt Agent Skills
dbt Labs released a curated collection of agent skills that encode dbt best practices—naming conventions, materialisation strategy, testing patterns, and macro usage5. These transform Codex CLI from a generalist into a data-aware agent:
# Install dbt agent skills into your project
cd your-dbt-project
git clone https://github.com/dbt-labs/dbt-agent-skills.git .skills/dbt
Reference them in your AGENTS.md:
## Data Engineering Conventions
This project uses dbt Core 1.11. Follow the conventions in `.skills/dbt/`:
- Model naming: `stg_` → `int_` → `fct_`/`dim_` (see .skills/dbt/naming-conventions.md)
- All models require at minimum `unique` and `not_null` tests on primary keys
- Use incremental materialisation for any model processing >1M rows daily
- Sources must be declared in `_sources.yml` before use
- SQL style: trailing commas, lowercase keywords, CTEs over subqueries
Generating dbt Models with Codex CLI
With the MCP server providing schema context and agent skills guiding conventions, Codex can generate production-quality dbt models:
codex "Create a staging model for the raw_orders table in our Snowflake
source. Include schema tests, a freshness check, and document all columns."
The agent will:
- Query the MCP server’s discovery tools to inspect the source schema
- Apply naming conventions from the dbt agent skills
- Generate the model SQL, schema YAML, and freshness configuration
- Run
dbt compileto validate syntax
Batch Model Generation with codex exec
For generating multiple models non-interactively—useful in CI or migration projects:
#!/bin/bash
# generate-staging-models.sh
SOURCES=$(dbt ls --resource-type source --output json | jq -r '.name')
for source in $SOURCES; do
codex exec "Generate a staging model for source '${source}'.
Follow .skills/dbt/naming-conventions.md.
Include unique and not_null tests on the primary key.
Output only the model SQL and schema YAML." \
--approval-mode full-auto \
--sandbox-permissions read-write
done
Airflow DAG Generation
Apache Airflow 3 introduced the Common AI Provider with native LLM and agent operator support2. Codex CLI can generate DAGs that leverage both the new @task.llm decorators and traditional dbt operators:
codex "Create an Airflow 3 DAG that:
1. Runs dbt source freshness checks at 06:00 UTC
2. Executes the staging → intermediate → marts layer in dependency order
3. Triggers data quality checks after each layer completes
4. Sends a Slack alert on any test failure
Use the CosmosOperator for dbt execution."
AGENTS.md for Airflow Conventions
## Airflow DAG Conventions
- Use Airflow 3.x TaskFlow API with type annotations
- dbt execution via astronomer-cosmos DbtTaskGroup
- Schedule intervals as cron expressions, never timedelta
- All DAGs must have `owner`, `retries: 2`, `retry_delay: 300`
- Data quality gates use the `@task.branch` pattern
- Connections referenced by ID only — never hardcode credentials
Data Quality Gates in CI
The codex exec command enables automated data quality validation as part of your CI pipeline6:
# .github/workflows/data-quality.yml
name: Data Quality Gate
on:
pull_request:
paths: ['models/**', 'tests/**']
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Validate model changes
run: |
codex exec "Review the changed dbt models in this PR. For each:
1. Check naming conventions match stg_/int_/fct_/dim_ patterns
2. Verify all primary keys have unique+not_null tests
3. Check for missing documentation on new columns
4. Validate no direct table references (must use ref/source)
5. Report any issues as a structured JSON array" \
--approval-mode full-auto \
--sandbox-permissions read-only
SQL Review Patterns
Codex CLI can perform warehouse-aware SQL review by combining MCP server access with static analysis:
codex exec "Using the dbt MCP server, analyse the SQL in models/marts/fct_revenue.sql:
1. Check for full table scans on tables >1B rows
2. Identify missing partition pruning on date columns
3. Flag any cross-database joins
4. Suggest materialisation strategy based on query complexity
Report findings as actionable review comments." \
--approval-mode full-auto
The text_to_sql Pipeline
The dbt MCP server’s text_to_sql tool generates SQL from natural language using your project’s semantic layer context3. This enables a powerful pattern for ad-hoc data exploration within Codex sessions:
sequenceDiagram
participant Dev as Developer
participant Codex as Codex CLI
participant MCP as dbt MCP Server
participant SL as Semantic Layer
participant WH as Warehouse
Dev->>Codex: "What's our monthly revenue by region?"
Codex->>MCP: text_to_sql(question)
MCP->>SL: Resolve metrics + dimensions
SL-->>MCP: Compiled SQL
MCP-->>Codex: Generated query
Codex->>MCP: execute_sql(query)
MCP->>WH: Run query
WH-->>MCP: Results
MCP-->>Codex: Data
Codex-->>Dev: Formatted answer + SQL
Practical Workflow: End-to-End Pipeline Creation
Here’s a complete workflow combining all components to build a new data pipeline from scratch:
# 1. Discover available sources
codex "List all sources in our dbt project that don't have staging models yet.
Use the dbt MCP server discovery tools."
# 2. Generate the full pipeline
codex "For the raw_payments source:
- Create stg_payments with column renaming, type casting, and deduplication
- Create int_payments_pivoted that pivots payment_method into columns
- Create fct_payment_summary with daily aggregations by customer
- Add schema tests, documentation, and freshness checks
- Generate an Airflow DAG that orchestrates the full pipeline
Follow all conventions in AGENTS.md and .skills/dbt/"
# 3. Validate the pipeline
codex exec "Run dbt compile on all new models, execute dbt test --select
stg_payments int_payments_pivoted fct_payment_summary, and report any
failures with suggested fixes." \
--approval-mode suggest
Permission Profiles for Data Engineering
Data engineering work often requires write access to warehouse metadata but should never modify production data directly. Configure a scoped permission profile:
[profiles.data-engineering]
sandbox = "permissive"
[profiles.data-engineering.permissions]
# Allow dbt commands and SQL compilation
allowed_commands = ["dbt", "uvx", "python"]
# Read warehouse credentials but never write them
file_policy = [
{ path = "~/.dbt/profiles.yml", access = "read" },
{ path = "./models/**", access = "read-write" },
{ path = "./tests/**", access = "read-write" },
{ path = "./dags/**", access = "read-write" },
{ path = "./target/**", access = "read" },
]
# Allow MCP server and warehouse connections
network_policy = [
{ domain = "localhost", ports = [8580], access = "allow" },
{ domain = "*.snowflakecomputing.com", access = "allow" },
]
Combining with Existing Codex Patterns
Agentic Pod for Data Teams
For larger data teams, deploy a multi-agent configuration:
- Schema Agent: Monitors source freshness, proposes new staging models
- Quality Agent: Runs data quality checks on every PR via
codex exec - Lineage Agent: Validates impact analysis before mart-layer changes merge
ExecPlan for Migration Projects
When migrating legacy ETL (stored procedures, SSIS packages) to dbt, use the ExecPlan pattern7:
# PLANS.md — Legacy SP Migration
## Objective
Migrate 47 stored procedures from SQL Server to dbt models on Snowflake.
## Phases
1. Inventory: Catalog all SPs with dependencies (complete)
2. MTR: Map each SP to target dbt model with lineage
3. Generate: Use Codex to create initial dbt models from SP logic
4. Validate: Parity testing between SP output and dbt model output
5. Cutover: Swap Airflow DAG references from SP tasks to dbt tasks
Key Considerations
Model hallucination risk: Codex may generate plausible-looking SQL that references non-existent columns or applies incorrect join logic. Always validate generated models against dbt compile and run dbt test before merging8.
Semantic layer accuracy: The text_to_sql tool relies on correctly configured metrics and dimensions. ⚠️ If your semantic layer definitions are incomplete, generated SQL may silently produce incorrect aggregations.
Cost management: Each MCP server call consumes tokens for schema serialisation. For large warehouses (500+ tables), consider enabling only the relevant toolsets via DBT_MCP_ENABLE_* variables to reduce context overhead.
Citations
-
dbt Labs, “Upgrading to v1.10,” dbt Developer Hub, 2026. https://docs.getdbt.com/docs/dbt-versions/core-upgrade/upgrading-to-v1.10 ↩
-
Apache Airflow, “Introducing the Common AI Provider: LLM and AI Agent Support for Apache Airflow,” 2026. https://airflow.apache.org/blog/common-ai-provider/ ↩ ↩2
-
dbt Labs, “About dbt Model Context Protocol (MCP) server,” dbt Developer Hub, 2026. https://docs.getdbt.com/docs/dbt-ai/about-mcp ↩ ↩2 ↩3 ↩4
-
dbt Labs, “Set up remote MCP,” dbt Developer Hub, 2026. https://docs.getdbt.com/docs/dbt-ai/setup-remote-mcp ↩
-
dbt Labs, “Make your AI better at data work with dbt’s agent skills,” dbt Developer Blog, 2026. https://docs.getdbt.com/blog/dbt-agent-skills ↩
-
OpenAI, “Non-interactive mode – Codex,” OpenAI Developers, 2026. https://developers.openai.com/codex/noninteractive ↩
-
OpenAI Cookbook, “Using PLANS.md for multi-hour problem solving,” Aaron Friel, October 2025. Referenced via Codex CLI documentation. ↩
-
DataCamp, “Codex CLI For Data Workflow Automation: A Complete Guide,” 2026. https://www.datacamp.com/tutorial/codex-cli-for-data-workflow-automation ↩