skill-up is a CLI evaluation framework for Agent Skill developers. Declare your eval environment, dependencies, test cases, and grading strategy in evals/eval.yaml and evals/cases/*.yaml, then run evaluations locally or in CI to generate structured reports.
Warning
This project is still in an early evolution stage: the code is not yet fully stable, and some CLI commands, configuration fields, and public APIs may still change in future releases. Please review the CHANGELOG and verify compatibility before using it in production.
- Declarative Eval Config: Define evaluation environment, engine, model, and cases through YAML (
eval.yaml+cases/*.yaml). - Multi-Engine Support: Works with Qoder CLI, Claude Code, and Codex as Agent Engines.
- Flexible Judging: Supports
rule_based,script, andagent_judgeevaluation strategies. - Structured Reports: Outputs Anthropic-compatible
grading.json,benchmark.json,benchmark.md, plusresult.json, JUnit XML, and HTML reports. - Anthropic Compatible: Import
evals.jsonviaskill-up import, or auto-detect with--auto. - CI-Ready: Designed for local development and continuous integration pipelines.
The official Agent Skills evaluation guide describes the right evaluation loop: write realistic cases, run with and without the Skill, grade outputs, aggregate results, and iterate. skill-up turns that workflow into a reusable CLI:
- Replaces ad hoc run folders with a declarative
eval.yaml+cases/*.yamlformat. - Automates workspace setup, Skill installation, Agent Engine invocation, judging, and report generation.
- Supports multiple engines (
claude_code,codex,qodercli) instead of tying the workflow to one client. - Keeps compatibility with Anthropic-style
evals.jsonwhile adding richer judges, CI-friendly commands, and structured reports.
For the best experience, use skill-upper — the Agent Skill shipped in this repository. It lets you ask an AI agent to scaffold, validate, run, and explain evals instead of hand-writing every YAML file first.
Recommended: install it with the skills CLI:
# Codex, global install
npx skills add https://github.com/alibaba/skill-up/tree/main/skills/skill-upper -g -a codex -y
# Claude Code, global install
npx skills add https://github.com/alibaba/skill-up/tree/main/skills/skill-upper -g -a claude-code -yYou do not need to install skill-up before installing this Skill.
skill-upper checks whether the skill-up command is available when it runs
and guides the agent through installation if it is missing.
Open the target Skill project in your AI agent. The target project should have this shape:
my-skill/
SKILL.md
Then ask the agent something concrete:
Use skill-upper to add evals for this Skill.
Add this evaluation case:
- Input: write a hello world program.
- Evaluation: check that the output contains hello and world.
After that run skill-up to validate and run.
The agent should create files like:
my-skill/
SKILL.md
evals/
eval.yaml
cases/
basic.yaml
my-skill-workspace/
iteration-1/
result.json
When evals/eval.yaml lives under a directory containing SKILL.md,
skill-up automatically installs that local Skill for the run, so you usually
do not need to list the Skill path manually in eval.yaml.
Install with the script:
curl -fsSL https://raw.githubusercontent.com/alibaba/skill-up/main/install.sh | bashThe installer downloads the matching binary from GitHub Releases.
To build locally from a checkout, install Go 1.25 or later:
make build
# or
go build -o bin/skill-up ./cmd/skill-upIn your Skill directory, create evals/eval.yaml:
schema_version: v1alpha1
environment:
type: none
engine:
name: claude_code
cases:
files:
- evals/cases/hello-world.yamlWhen evals/eval.yaml lives under a directory that contains SKILL.md, skill-up installs the current Skill automatically. The omitted fields use defaults: JSON report output, timeout_seconds: 300, max_turns: 10, and parallelism: 1.
For the full eval.yaml schema, see Writing Evals.
Create evals/cases/hello-world.yaml:
input:
prompt: |
Please generate a Hello World program
expect:
must_contain:
- "Hello"
- "World"The case id defaults to the filename (hello-world). Add a judge block only when you need script-based or agent-based grading.
skill-up validateThis step is optional, but useful before the first run: it checks eval.yaml and all referenced case files without starting an Agent Engine.
skill-up runResults are written to <skill-name>-workspace/iteration-1/.
For engineering conventions (Conventional Commits, Git hooks, golangci-lint), see CONTRIBUTING.md.
skill-up auto-loads an optional user-level config that supplies default OpenTelemetry env vars and per-environment runtime kwargs. The embedded defaults are empty; downstream consumers maintain their own config file.
embed (empty) < user (~/.config/skill-up/config.yaml) < project ($PWD/.skill-up.yaml) < explicit (--config)
| Source | Path |
|---|---|
embed |
empty Config{} — no vendor defaults baked in |
user |
$SKILL_UP_CONFIG, else $XDG_CONFIG_HOME/skill-up/config.yaml, else ~/.config/skill-up/config.yaml |
project |
$PWD/.skill-up.yaml |
explicit |
--config <path> (must exist) |
Missing files at the user and project layers are silently skipped; a missing --config path is a hard error. A corrupt config at any layer also fails the run.
skill-up init # writes a template to ~/.config/skill-up/config.yaml (XDG-aware)
skill-up init --local # writes a template to $PWD/.skill-up.yaml
skill-up init --print # prints the template to stdout
skill-up init --force # overwrite an existing file
skill-up init --config foo.yaml # reads foo.yaml, writes it to ~/.config/skill-up/config.yaml
skill-up init --config foo.yaml --local # reads foo.yaml, writes it to $PWD/.skill-up.yamlWith --config <path>, init reads that file (validating it as a skill-up
config) and writes its raw bytes to the target — comments and formatting are
preserved. Without --config, init writes a commented YAML template.
schema_version: v1alpha1
kind: SkillUpConfig
telemetry:
service_name: skill-up # OTEL_SERVICE_NAME
traces_exporter: otlp # OTEL_TRACES_EXPORTER
traces:
endpoint: http://localhost:4317 # OTEL_EXPORTER_OTLP_TRACES_ENDPOINT (4317 for grpc, 4318/v1/traces for http/protobuf)
protocol: grpc # OTEL_EXPORTER_OTLP_TRACES_PROTOCOL (grpc | http/protobuf); skill-up defaults to grpc
resource_attributes: # serialized into OTEL_RESOURCE_ATTRIBUTES
deployment.environment: local
verbose: false # if true, also enables OTEL_LOG_* payload capture
env: # arbitrary defaults, applied only-if-unset
OTEL_EXPORTER_OTLP_HEADERS: authorization=${OTLP_TOKEN}
runtime_kwargs: # keyed by environment.type
opensandbox:
base_url: http://localhost:8080
# extensions: '{}'For environment variables: any value already set in the process environment wins; the config only fills in missing keys.
For runtime_kwargs: explicit --runtime-kwarg on run > eval.yaml environment.kwargs > user-config runtime_kwargs[environment.type].
Prefer ${ENV_VAR} references inside the config file rather than baking secret literals. The redaction mechanism (userconfig.Redact) masks fields tagged secret:"true" when printing; currently no Config field carries the tag, but the mechanism is in place for future fields.
Use skill-up import to migrate an Anthropic-compatible evals.json into the YAML layout used by this repo:
skill-up import ./evals/evals.json --output ./evals| Command | Description |
|---|---|
skill-up run [path] |
Run evaluation cases and produce reports |
skill-up validate [path] |
Validate eval.yaml and case files |
skill-up list-cases [path] |
List all cases referenced by the config |
skill-up report <result.json> |
Generate reports from a previous run |
skill-up import <evals.json> |
Import Anthropic evals.json to YAML cases |
skill-up debug judge <input.json> |
Debug judge module with a JSON input |
skill-up debug report <input.json> |
Debug report module with a JSON input |
Apache License 2.0 — see LICENSE.
