Sketchnote diagram for: Tessl Skill Evaluation Framework: Treating Agent Skills as Production Software

Tessl Skill Evaluation Framework: Treating Agent Skills as Production Software

You have written a skill for your coding agent. It looks right. It seems to work when you try it. But does it actually change the agent’s behaviour, or would the base model have done just as well without it? Tessl exists to answer that question with data rather than intuition.

What Tessl Is

Tessl is an agent enablement platform that treats skills, the structured instruction files that guide AI coding agents, as first-class software artefacts with a proper lifecycle¹. Where most developers write a SKILL.md, copy it into their repository, and hope for the best, Tessl provides the infrastructure to build, evaluate, distribute, and maintain skills over time.

The platform has three layers:

A CLI (npm i -g @tessl/cli) for creating, linting, evaluating, and publishing skills².
A registry containing over 2,000 evaluated skills, searchable and installable like npm packages¹.
An evaluation engine that measures whether a skill genuinely improves agent performance on real tasks.

Tessl supports multiple agents out of the box. Running tessl init lets you configure for Claude Code, Cursor, Gemini, Codex CLI, OpenClaw, Copilot, or any agent that reads the Agent Skills specification².

The Evaluation Problem

Traditional software testing assumes determinism: the same input produces the same output. AI agents break this assumption in two ways³:

Model volatility. Providers continuously release new model versions that change agent behaviour without any configuration change on your part.
Inference stochasticity. Even identical tasks produce varying results due to temperature, sampling, and context window variations.

‘It worked when I tried it’ provides almost no validation signal for non-deterministic systems³. A skill that appears to help today might become redundant after a model update, or worse, actively contradict the model’s improved capabilities. You need a repeatable, statistical evaluation methodology.

Two Evaluation Methods

Tessl provides two complementary approaches to skill evaluation⁴.

Review Evaluations

A review evaluation tests the structure of your skill against Anthropic’s best practices for writing agent instructions. It checks whether your skill is well-formed, whether models and agents can parse it correctly, and whether the instructions are clear enough to influence behaviour reliably⁴.

Run a review with:

tessl skill review ./my-skill

The review score runs on every publish, before anyone in the registry sees your skill. Think of it as a linter for instruction quality.

For iterative improvement, add the --optimize flag:

tessl skill review ./my-skill --optimize --max-iterations 5

This runs up to five improvement loops, each refining the skill file against the specification².

Task Evaluations

Task evaluations measure whether a skill changes agent behaviour in practice. The methodology is straightforward⁵:

Tessl analyses your skill and generates evaluation scenarios automatically.
For each scenario, the agent attempts the task without the skill (baseline run).
The same agent attempts the same task with the skill injected.
Tessl compares the two results and scores the gap.

The gap between baseline and with-skill scores tells you what you need to know⁵:

Gap	Interpretation
Large positive	The skill effectively steers agent behaviour
Small positive	The skill helps, but the agent already performs the task reasonably well
Near zero	The skill has no measurable effect
Negative	The skill’s instructions may contradict the model’s training

Scenario-Based Evaluation in Practice

Scenarios live in an evals/ directory alongside your skill⁶:

my-skill/
  SKILL.md
  tile.json
  evals/
    instructions.json
    scenario-1/
      task.md
      criteria.json
      capability.txt
    scenario-2/
      task.md
      criteria.json
      capability.txt
    summary.json

Each scenario has three files⁶:

task.md contains the task brief presented to the agent, potentially including sample data.
capability.txt declares which skill capability the scenario tests.
criteria.json defines the scoring rubric, the specific metrics that determine success.

You can generate scenarios in three ways:

# Automatic generation from your skill
tessl scenario generate ./my-skill --count=5

# From specific git commits (for codebase-aware scenarios)
tessl scenario generate ./my-skill --commits abc123,def456

# From pull requests
tessl scenario generate ./my-skill --prs 42,43

Then run the evaluation:

tessl eval run ./my-skill --agent claude:claude-sonnet-4-6 --label "v2 comparison"

View results with:

tessl eval view --last
tessl eval compare --breakdown

Harbor: The Open-Source Foundation

Underneath Tessl’s evaluation engine sits Harbor, an open-source framework for evaluating AI agents in containerised environments³. Harbor emerged from Terminal-Bench, the collaborative benchmark developed by Laude Institute and Stanford researchers.

Harbor uses statistical evaluation rather than binary pass/fail³:

Runs tasks multiple times across configurations.
Computes pass rates over N trials instead of relying on single runs.
Reports mean performance and variance.
Enables hypothesis testing for configuration changes.

This statistical approach is essential because a single evaluation run tells you almost nothing about a non-deterministic system. You need repeated trials to separate genuine skill impact from random variation.

Harbor serves three functions within the broader ecosystem³:

Evaluation gates. Pre-deployment validation, similar to CI/CD testing for traditional software.
Regression detection. Identifying whether prompt or configuration changes improve or degrade performance over time.
Training signals. Enabling Reinforcement Learning from Verifiable Rewards, where successful task trajectories feed back into model improvement.

Skill Creator V2: The Authoring Side

While Tessl focuses on the evaluation and distribution lifecycle, Anthropic’s Skill Creator V2 provides complementary tooling for the authoring phase⁷. The two systems address different parts of the same problem.

Skill Creator V2 introduces several capabilities relevant to evaluation⁷:

Parallel sub-agent execution. Independent agents execute evaluations in parallel, each operating in isolated contexts with separate token and timing metrics. This eliminates context bleeding between test runs.

Comparator agents. A/B comparison functionality tests two skill versions, or skill versus baseline, using comparator agents that evaluate outputs without knowing which variant they are assessing. This ensures unbiased quality judgement⁷.

Benchmark mode. A standardised assessment tracks three metrics: evaluation pass rate, elapsed execution time, and token consumption. Teams can deploy benchmarks after model updates or during iterative skill refinement.

In one documented example, the baseline skill scored 81 on evaluation metrics while the improved version achieved 97.5, a 15.7 per cent overall improvement across test cases⁸. Two of four evaluations showed measurable gains while two tied with the baseline.

Description optimisation. Skill Creator V2 analyses skill descriptions against sample prompts and suggests refinements to reduce both false positives and false negatives in skill triggering. Testing across six public document-creation skills showed triggering improvements in five of them⁷.

The Cost Question

A recurring concern in the community is whether the token cost of evaluation is justified. Running parallel executions with comparator agents can cost between $12 and $15 per evaluation cycle, depending on the model and number of scenarios⁸.

The answer depends on what the skill does:

Skill type	Evaluation spend	Justification
Personal formatting helper	Probably not worth it	The skill is cheap to rewrite if it breaks
Team-shared code review skill	Worth a single evaluation run	Affects multiple developers daily
Enterprise workflow automation	Worth regular scheduled evaluation	A Jira-to-deployment pipeline that saves 100 engineer-hours per month easily justifies $15 in evaluation tokens
Registry-published skill	Essential	You owe it to consumers to prove the skill works

Tessl’s review evaluations are free and run automatically on publish. Task evaluations consume agent tokens but provide the only reliable signal of real-world effectiveness.

What This Means for Codex CLI Users

Codex CLI’s skill system uses AGENTS.md with a three-tier knowledge architecture: constitution files for project-wide rules, specialist files for domain-specific guidance, and skill files for reusable task patterns. Tessl integrates with Codex CLI through tessl init --agent codex².

The practical workflow for a Codex CLI user looks like this:

Write your skill as a SKILL.md following the Agent Skills specification.
Lint with tessl skill lint to catch structural issues.
Review with tessl skill review --optimize to improve instruction quality.
Generate scenarios with tessl scenario generate to create test cases.
Evaluate with tessl eval run to measure real behavioural impact.
Publish with tessl skill publish to share on the registry.
Monitor with scheduled evaluation runs after model updates.

This lifecycle mirrors what Daniel has described as the aspiration to ‘treat skills like a production artefact.’ Tessl provides the missing infrastructure to make that practical.

The Durability Problem

Skills degrade. Models improve. A skill that provided a 30 per cent boost six months ago may now contradict the model’s native capabilities. Tessl’s scheduled evaluation pattern addresses this: run tessl eval run on a weekly cadence and compare scores against the historical baseline with tessl eval compare⁶.

When a skill’s with-skill score drops to match or fall below the baseline, you know it is time to retire or rewrite the skill rather than letting it silently degrade your agent’s performance.

This is the difference between treating skills as static configuration files and treating them as living software with a maintenance lifecycle.

Summary

Tessl is a package manager and evaluation platform for AI agent skills, with a CLI, registry, and evaluation engine.
Review evaluations check skill structure against best practices. Task evaluations measure real behavioural impact by comparing agent performance with and without the skill.
Harbor provides the open-source statistical evaluation foundation, running multiple trials to account for inference stochasticity.
Skill Creator V2 complements Tessl with parallel sub-agent execution, comparator agents, and benchmark mode for the authoring phase.
Codex CLI users can integrate Tessl into their skill lifecycle with tessl init --agent codex.
The cost of evaluation ($12 to $15 per cycle) is justified for team-shared and enterprise skills, but overkill for simple personal helpers.
Skills degrade over time as models improve. Scheduled evaluation runs detect this before it causes silent performance loss.