From Skill Generation to Skill Qualification: An Overview on Waza and skill-validator

Skill Review Is the New Code Review

We’ve normalized code review. Most teams won’t merge a PR without at least one pair of eyes on it. But here’s what many teams haven’t caught up to yet: today it’s not just code that shapes what ends up in production. It’s the skills, prompts, and instructions that generate the code, review it, and document it. The output is only as good as the instructions behind it.

Skill review can and should become a standard part of the development process, just like code review. And just like code review depends on linters, CI checks, and test suites to work at scale, skill review needs its own tooling. That’s exactly what this post is about.

One Step Back

Before getting into the tools, a quick refresher: what is an Agent Skill and why should you care?

Every time an AI agent, like Claude Code, Copilot, or Cursor, is working on your codebase, it operates within a limited context window. That context can include conversation history, project files, and skills defined in advance. An Agent Skill is a documented package that tells the agent: “when you encounter this type of task, behave this way.” Think of it as an SOP (Standard Operating Procedure) for the agent. Something like “write idiomatic Java,” “add meaningful comments to code,” or “follow our team’s coding conventions.”

The problem is that writing a good skill looks simpler than validating and testing it. That’s exactly where tools like Waza and skill-validator come in.

Waza: A Unified Platform for the Skill Lifecycle

Waza (技, Japanese for “skill” or “technique”) is a new CLI tool built by Microsoft in Go. Single binary, no extra dependencies, runs on Linux, macOS, and Windows. The goal is to consolidate the entire skill development lifecycle into one tool.

The Problem Waza Solves

Say you’ve just written a new skill. A few questions come up immediately:

Does the agent actually recognize it and use it at the right moment? Does it behave the same way on GPT and Claude? Does it manage the token budget well? How do I connect this to a CI/CD pipeline? Did I structure the frontmatter correctly and is the overall skill structure sound?

Before Waza, answering these questions meant juggling separate tools and a lot of manual testing. Waza tries to eliminate that fragmentation.

The Four-Phase Workflow

Scaffold: Generates a spec-compliant structure from the start so you don’t build on a broken foundation.

Develop: Gives you real-time compliance scoring as you write content, so you end up with a properly formed skill.

Test: Runs the skill in a loop against real LLMs, checking that the agent behaves correctly at each step.

Evaluate: Cross-model comparison with comprehensive metrics.

Practical Example: Comparing Two Models

Say you’ve written a skill for generating PostgreSQL migration scripts. You want to see how differently Claude Sonnet 4.6 and GPT-4.4 interact with it:

waza run eval.yaml --model claude-sonnet-4-6 -o claude.json
waza run eval.yaml --model gpt-4-4 -o gpt4.json
waza compare claude.json gpt4.json

Instead of manually testing each model and comparing results in your head, you get structured output that tells you which model aligns better with your skill’s intent.

11 Validator Types

Waza ships with 11 validator types: simple text matching, Python assertions, JSON Schema validation, and LLM-powered evaluation where another model acts as judge. The most architecturally interesting ones are Action Sequence and Skill Invocation, which let you verify exactly which tools the agent used and in what order.

Where the Roadmap Stands

The project is still in active development. E2 (Sensei Engine for compliance scoring) and E3 (Evaluation Framework with statistical analysis) are not complete yet. If you want to adopt Waza today, E1 (the CLI foundation) and E6 (CI/CD integration) are finished and stable.

skill-validator: When Compliance Is Not Enough

skill-validator is a standalone CLI tool, not affiliated with Microsoft and not tied to Waza. Also written in Go. Its job is measuring the health and validity of skills, making sure a skill is both structurally correct and content-sound. For example, frontmatter structure has changed between early and current versions of the spec, and getting it wrong silently breaks things.

On the other side, a skill can be perfectly spec-compliant and still be a disaster in practice. Broken links, a reference file with 60,000 tokens, technically passes validation but performs terribly.

This tool goes beyond structural checks and measures content quality too.

What It Checks

Structure: Does SKILL.md exist? Are the frontmatter fields valid? Is the token budget respected? (SKILL.md body shouldn’t exceed 5,000 tokens (words), and each reference file shouldn’t go past 25,000 tokens.)

Links: Do external URLs actually resolve? Worth noting: HTTP 403 responses are flagged as warnings, not errors, because many sites check request headers and block requests that don’t come from a browser. Since the request comes from an AI agent rather than a browser, they block it.

Content quality: Several metrics: code-to-text ratio, imperative sentence ratio, instruction specificity (the ratio of directive language like “must/always/never” to advisory language like “may/consider”), and information density.

Contamination detection: One of the more interesting features. If a skill written for MongoDB contains Python, JavaScript, and Shell examples side by side, it can confuse the agent. This tool detects that cross-language contamination.

LLM-as-judge scoring: Using Claude or GPT-4o, your skill gets scored across 6 dimensions: Clarity, Actionability, Token Efficiency, Scope Discipline, Directive Precision, and Novelty. Research identifies Novelty as the most important one: skills that give the model information it doesn’t already know from training data add the most value to the agent.

Practical Example: Running a Full Check

skill-validator check --strict --emit-annotations ./my-skill/

Output looks something like this:

Validating skill: my-skill/

Structure
  ✓ SKILL.md found

Frontmatter
  ✓ name: "my-skill" (valid)
  ✓ description: (54 chars)

Tokens
  SKILL.md body:        1,250 tokens
  references/guide.md:    820 tokens
  ─────────────────────────────────────
  Total:                2,070 tokens

Result: passed

CI/CD Integration

A complete GitHub Actions workflow looks like this:

name: Validate Skills
on:
  pull_request:
    paths:
      - "skills/**"
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install skill-validator
        run: brew install agent-ecosystem/tap/skill-validator
      - name: Validate skills
        run: |
          skill-validator check --strict --emit-annotations skills/
          skill-validator check --strict -o markdown skills/ >> "$GITHUB_STEP_SUMMARY"

Important: --strict means both errors and warnings will fail the pipeline. For skills still in draft, remove --strict so warnings stay non-blocking.

When to Use Which

These two tools are not competing. They complement each other.

skill-validator is built for validation and quality checks. It’s fast, works without an API key (except for the LLM scoring part), and works great as a pre-commit hook. If you want to make sure your skill is structurally and content-wise sound, this is where you start.

Waza is built for execution, evaluation, and cross-model comparison. If you want to know how a skill actually behaves alongside a live agent, or how GPT and Claude handle it differently, that’s Waza’s territory.

In a mature pipeline, both have a place: skill-validator at pre-commit and PR validation, Waza at regression testing and benchmark tracking.

Closing Thoughts

If you’re serious about integrating skills into your development process, which is increasingly just table stakes, this toolchain is directly relevant. A bad skill is like bad code, except its impact can be much broader.

There’s a bigger picture here too. We’re entering an era where agent skills are becoming independent artifacts, versioned and consumed like packages or libraries. The same way we have linters and test suites for npm packages, we need the same infrastructure for skills. Waza and skill-validator are the first generation of that infrastructure.