Auto Skill Improver — Benchmark-Driven Prompt Optimisation for Claude Code & OpenClaw
Inspired by Andrej Karpathy's autoresearch — stop guessing whether your prompts got better. Measure it.
Choose Your Platform
Auto Skill Improver works across the Claude ecosystem. Choose the version tailored to your workflow.
For Claude Code
Benchmark and improve CLAUDE.md files, project instructions, and coding skill configurations.
Learn more→👥For Claude Cowork
Test and optimise Cowork skill files, team-facing workflows, and project instructions.
Learn more→🐾For OpenClaw
Improve SKILL.md files, agent persona configurations, and multi-agent coordination skills.
Learn more→How It Works
- 1Classify — the tool identifies your skill type (coding, research, workflow, etc.)
- 2Benchmark — it builds a test suite that distinguishes success from failure
- 3Mutate — one change at a time, each tested against the benchmark
- 4Keep or discard — only mutations that measurably improve the score survive
Get the Guide File
Enter your email to download auto-skill-improver-quickstart.md and get access to the GitHub repository.
No spam, ever. Your email is stored securely so we can send you updates about new use cases and workflows.
Setup Instructions
Choose your platform above
Pick Claude Code, Claude Cowork, or OpenClaw — each has a tailored quickstart file.
Download the quickstart file
Enter your email to get the platform-specific guide.
Upload the file to your AI tool
The tool reads the instructions and installs Auto Skill Improver automatically.
Point it at any skill or instruction file
Run: npm run improve -- --skill path/to/your-skill.md
Review baseline, run mutations, keep improvements
The tool measures each change against a real benchmark and only keeps gains.
Done when the benchmark saturates
When no more gains are possible, your skill file is fully optimised.
Why Most Prompt Iteration Fails
You change a prompt. It sounds better. You keep it. But nothing measurably improved. Most prompt engineering is editorial — rewriting based on intuition rather than evidence. Auto Skill Improver treats skill iteration as empirical, not editorial.
Prompt Vibes
- ✗Change wording, hope it helps
- ✗No baseline — no way to know if things got better
- ✗Multiple changes at once hide what actually worked
- ✗Subjective evaluation: 'this feels right'
Prompt Science
- ✓Establish a measurable baseline before any changes
- ✓Mutate one variable at a time
- ✓Run the same benchmark before and after
- ✓Keep only what scores higher — discard the rest
What Auto Skill Improver Finds
The tool surfaces structural problems that are invisible during manual prompt editing — issues that silently degrade performance across runs.
Ambiguous Output Contracts
Vague success criteria that let the model produce wildly different outputs on each run.
Missing Fallback Behaviour
No defined recovery path when a tool call fails or returns unexpected data.
Conflicting Instruction Layers
Contradictory directives spread across system prompts, skills, and CLAUDE.md files.
Dependency & Portability Problems
Hard-coded paths, missing imports, or assumptions that break on different machines.
Weak Evidence Discipline
Claims without citations, assertions without data, decisions without reasoning chains.
Structural Formatting Issues
Inconsistent heading levels, broken markdown, or output that doesn't match the stated format.
The Karpathy-Inspired Method
Andrej Karpathy's autoresearch applies measurement discipline to research iteration. Auto Skill Improver applies the same principle to prompt and skill engineering — a controlled loop where every change is accountable.
Classify the Skill Type
The tool analyses your skill file and determines its category — coding assistant, research agent, workflow orchestrator, or something else. Classification informs what benchmarks make sense.
Build a Real Benchmark
Not a vibes check. A structured test suite with pass/fail criteria that exercises the skill's actual capabilities against representative inputs.
Establish a Baseline
Run the benchmark on the unmodified skill. Record the score. This is your point of comparison — every mutation is measured against this baseline.
Mutate One Thing at a Time
Change a single instruction, add one constraint, remove one ambiguity. Never change multiple variables simultaneously — otherwise you can't attribute improvement.
Keep Only What Improves
Re-run the benchmark after each mutation. If the score goes up, the change stays. If it doesn't improve — or regresses — it gets discarded. No exceptions.
Stop When the Benchmark Saturates
When successive mutations stop producing gains, the skill has reached its current ceiling. Further changes are noise, not signal. Move on or build a harder benchmark.
When to Use It — and When Not To
Best For
- Claude Code skills and CLAUDE.md configurations
- OpenClaw multi-agent skill files
- Cowork skills that need measurable improvement
- Any prompt where you need evidence that changes actually help
Not the Right Fit
- —One-off prompts you'll use once and discard
- —Creative writing where there's no objective success metric
- —Prompts that are already performing at ceiling
- —Situations where you can't define what 'better' means
Frequently Asked Questions
Auto Skill Improver is an open-source tool that applies benchmark-driven iteration to AI skill files. It classifies your skill, builds a test suite, establishes a baseline score, then systematically mutates one thing at a time — keeping only changes that measurably improve performance. It works with Claude Code skills, OpenClaw agent configurations, and Cowork skill files.
Andrej Karpathy's autoresearch methodology applies measurement discipline to research iteration — running controlled experiments instead of making intuitive changes. Auto Skill Improver applies the same principle to prompt engineering: every change is tested against a benchmark, and only measurable improvements survive. It's the difference between prompt vibes and prompt science.
Any skill file that Claude Code, Cowork, or OpenClaw uses — coding assistants, research agents, workflow orchestrators, code reviewers, and more. The tool classifies the skill type automatically and builds appropriate benchmarks for that category.
A benchmark is a structured test suite with defined inputs and pass/fail criteria. For a coding skill, this might be 'generate a function that passes these unit tests'. For a research skill, it might be 'find and correctly cite three sources on this topic'. The benchmark runs the same tests before and after each change, producing a numerical score that makes improvement objective.
The tool is designed for Claude Code, Cowork, and OpenClaw skill files — structured prompt files that define agent behaviour. It's not a general-purpose prompt optimiser for one-shot prompts. The value comes from iterating on skills that are used repeatedly, where small improvements compound across hundreds of runs.
Benchmark saturation occurs when successive mutations stop producing score improvements. It means the skill has reached the ceiling of what the current benchmark can measure. At this point, further changes are noise rather than signal. You can either accept the current performance or build a harder benchmark that tests more advanced capabilities.
Manual iteration is editorial: you rewrite a prompt, it sounds better, you keep it. But 'sounds better' isn't evidence. Auto Skill Improver is empirical: it establishes a baseline, changes one variable at a time, re-runs the same benchmark, and keeps only what scores higher. It also catches structural problems — contradictory instructions, missing fallback behaviour, ambiguous output contracts — that are invisible during manual editing.
Yes. Auto Skill Improver is fully open source and free to use. The source code is available on GitHub at github.com/mlobo2012/auto-skill-improver. There are no usage limits, no API keys required for the tool itself, and no premium tiers.
Stop Guessing. Start Measuring.
Download the quickstart guide, clone the repo, and run your first benchmark-driven improvement loop in under 10 minutes.