Free & Open Source

Auto Skill Improver — Benchmark-Driven Prompt Optimisation for Claude Code & OpenClaw

Inspired by Andrej Karpathy's autoresearch — stop guessing whether your prompts got better. Measure it.

How It Works

  1. 1Classify — the tool identifies your skill type (coding, research, workflow, etc.)
  2. 2Benchmark — it builds a test suite that distinguishes success from failure
  3. 3Mutate — one change at a time, each tested against the benchmark
  4. 4Keep or discard — only mutations that measurably improve the score survive

Get the Guide File

Enter your email to download auto-skill-improver-quickstart.md and get access to the GitHub repository.

No spam, ever. Your email is stored securely so we can send you updates about new use cases and workflows.

Setup Instructions

1

Choose your platform above

Pick Claude Code, Claude Cowork, or OpenClaw — each has a tailored quickstart file.

2

Download the quickstart file

Enter your email to get the platform-specific guide.

3

Upload the file to your AI tool

The tool reads the instructions and installs Auto Skill Improver automatically.

4

Point it at any skill or instruction file

Run: npm run improve -- --skill path/to/your-skill.md

5

Review baseline, run mutations, keep improvements

The tool measures each change against a real benchmark and only keeps gains.

6

Done when the benchmark saturates

When no more gains are possible, your skill file is fully optimised.

Why Most Prompt Iteration Fails

You change a prompt. It sounds better. You keep it. But nothing measurably improved. Most prompt engineering is editorial — rewriting based on intuition rather than evidence. Auto Skill Improver treats skill iteration as empirical, not editorial.

Prompt Vibes

  • Change wording, hope it helps
  • No baseline — no way to know if things got better
  • Multiple changes at once hide what actually worked
  • Subjective evaluation: 'this feels right'

Prompt Science

  • Establish a measurable baseline before any changes
  • Mutate one variable at a time
  • Run the same benchmark before and after
  • Keep only what scores higher — discard the rest

What Auto Skill Improver Finds

The tool surfaces structural problems that are invisible during manual prompt editing — issues that silently degrade performance across runs.

📝

Ambiguous Output Contracts

Vague success criteria that let the model produce wildly different outputs on each run.

🔄

Missing Fallback Behaviour

No defined recovery path when a tool call fails or returns unexpected data.

Conflicting Instruction Layers

Contradictory directives spread across system prompts, skills, and CLAUDE.md files.

🔗

Dependency & Portability Problems

Hard-coded paths, missing imports, or assumptions that break on different machines.

📊

Weak Evidence Discipline

Claims without citations, assertions without data, decisions without reasoning chains.

🏗️

Structural Formatting Issues

Inconsistent heading levels, broken markdown, or output that doesn't match the stated format.

The Karpathy-Inspired Method

Andrej Karpathy's autoresearch applies measurement discipline to research iteration. Auto Skill Improver applies the same principle to prompt and skill engineering — a controlled loop where every change is accountable.

1

Classify the Skill Type

The tool analyses your skill file and determines its category — coding assistant, research agent, workflow orchestrator, or something else. Classification informs what benchmarks make sense.

2

Build a Real Benchmark

Not a vibes check. A structured test suite with pass/fail criteria that exercises the skill's actual capabilities against representative inputs.

3

Establish a Baseline

Run the benchmark on the unmodified skill. Record the score. This is your point of comparison — every mutation is measured against this baseline.

4

Mutate One Thing at a Time

Change a single instruction, add one constraint, remove one ambiguity. Never change multiple variables simultaneously — otherwise you can't attribute improvement.

5

Keep Only What Improves

Re-run the benchmark after each mutation. If the score goes up, the change stays. If it doesn't improve — or regresses — it gets discarded. No exceptions.

6

Stop When the Benchmark Saturates

When successive mutations stop producing gains, the skill has reached its current ceiling. Further changes are noise, not signal. Move on or build a harder benchmark.

When to Use It — and When Not To

Best For

  • Claude Code skills and CLAUDE.md configurations
  • OpenClaw multi-agent skill files
  • Cowork skills that need measurable improvement
  • Any prompt where you need evidence that changes actually help

Not the Right Fit

  • One-off prompts you'll use once and discard
  • Creative writing where there's no objective success metric
  • Prompts that are already performing at ceiling
  • Situations where you can't define what 'better' means

Frequently Asked Questions

Auto Skill Improver is an open-source tool that applies benchmark-driven iteration to AI skill files. It classifies your skill, builds a test suite, establishes a baseline score, then systematically mutates one thing at a time — keeping only changes that measurably improve performance. It works with Claude Code skills, OpenClaw agent configurations, and Cowork skill files.

Andrej Karpathy's autoresearch methodology applies measurement discipline to research iteration — running controlled experiments instead of making intuitive changes. Auto Skill Improver applies the same principle to prompt engineering: every change is tested against a benchmark, and only measurable improvements survive. It's the difference between prompt vibes and prompt science.

Stop Guessing. Start Measuring.

Download the quickstart guide, clone the repo, and run your first benchmark-driven improvement loop in under 10 minutes.