Auto Skill Improver for Claude Code — Benchmark-Driven Skill Optimisation
Test and improve your CLAUDE.md files, project instructions, and coding skills with empirical measurement. Inspired by Karpathy's autoresearch.
How It Works with Claude Code
- 1Classify — the tool identifies your Claude Code skill type (coding assistant, reviewer, orchestrator, etc.)
- 2Benchmark — it builds a test suite that exercises your CLAUDE.md instructions against real scenarios
- 3Mutate — one instruction change at a time, each tested against the benchmark
- 4Keep or discard — only mutations that measurably improve Claude Code output survive
Get the Guide File
Enter your email to download auto-skill-improver-claude-code-quickstart.md and get access to the GitHub repository.
No spam, ever. Your email is stored securely so we can send you updates about new use cases and workflows.
Step-by-Step: Set Up Auto Skill Improver in Claude Code
Download the quickstart file above
Enter your email in the form above to download the Claude Code quickstart file.
Open Claude Code
Launch Claude Code in your terminal or IDE.
Upload the quickstart file
Claude Code reads it and clones the repository automatically.
Point it at your CLAUDE.md or any instruction file
The tool targets your project instructions and begins setup.
Review the baseline score
This is your starting point before any changes.
Let it run mutations
Each change is benchmarked and only kept if it improves the score.
Done when the benchmark saturates
When no more gains are possible, your improved skill file is ready.
Why Most Claude Code Skill Iteration Fails
You tweak your CLAUDE.md. The instructions sound clearer. You keep them. But Claude Code's actual output didn't measurably improve. Most skill editing is editorial — rewriting based on intuition rather than evidence. Auto Skill Improver treats Claude Code skill iteration as empirical, not editorial.
Instruction Vibes
- ✗Reword CLAUDE.md, hope Claude Code performs better
- ✗No baseline — no way to know if instructions actually improved output
- ✗Multiple instruction changes at once hide what actually helped
- ✗Subjective evaluation: 'the output looks right'
Instruction Science
- ✓Establish a measurable baseline before any CLAUDE.md changes
- ✓Mutate one instruction at a time
- ✓Run the same benchmark before and after each change
- ✓Keep only what scores higher — discard the rest
What It Finds in Claude Code Skills
The tool surfaces structural problems in your CLAUDE.md and project instructions that are invisible during manual editing — issues that silently degrade Claude Code performance across runs.
Ambiguous Output Contracts
Vague success criteria that let Claude Code produce wildly different outputs on each run.
Missing Fallback Behaviour
No defined recovery path when a tool call fails or returns unexpected data.
Conflicting Instruction Layers
Contradictory directives spread across system prompts, skills, and CLAUDE.md files.
Dependency & Portability Problems
Hard-coded paths, missing imports, or assumptions that break on different machines.
Weak Evidence Discipline
Claims without citations, assertions without data, decisions without reasoning chains.
Structural Formatting Issues
Inconsistent heading levels, broken markdown, or output that doesn't match the stated format.
The Karpathy-Inspired Method
Andrej Karpathy's autoresearch applies measurement discipline to research iteration. Auto Skill Improver applies the same principle to Claude Code skill engineering — a controlled loop where every CLAUDE.md change is accountable.
Classify the Skill Type
The tool analyses your CLAUDE.md file and determines its category — coding assistant, code reviewer, workflow orchestrator, or something else. Classification informs what benchmarks make sense.
Build a Real Benchmark
Not a vibes check. A structured test suite with pass/fail criteria that exercises your Claude Code instructions against representative coding scenarios.
Establish a Baseline
Run the benchmark on the unmodified CLAUDE.md. Record the score. This is your point of comparison — every mutation is measured against this baseline.
Mutate One Thing at a Time
Change a single instruction, add one constraint, remove one ambiguity. Never change multiple CLAUDE.md directives simultaneously — otherwise you can't attribute improvement.
Keep Only What Improves
Re-run the benchmark after each mutation. If the score goes up, the change stays. If it doesn't improve — or regresses — it gets discarded. No exceptions.
Stop When the Benchmark Saturates
When successive mutations stop producing gains, your CLAUDE.md has reached its current ceiling. Further changes are noise, not signal. Move on or build a harder benchmark.
When to Use It — and When Not To
Best For
- CLAUDE.md files that define project-wide coding behaviour
- Custom Claude Code skill configurations
- Project instructions that need measurable improvement
- Any Claude Code setup where you need evidence that changes actually help
Not the Right Fit
- —One-off prompts you'll use once and discard
- —Creative writing where there's no objective success metric
- —Skills that are already performing at ceiling
- —Situations where you can't define what 'better' means
Frequently Asked Questions
Auto Skill Improver for Claude Code is an open-source tool that applies benchmark-driven iteration to your CLAUDE.md files and project instructions. It classifies your skill type, builds a test suite, establishes a baseline score, then systematically mutates one instruction at a time — keeping only changes that measurably improve Claude Code performance.
It treats your CLAUDE.md as a testable artefact. The tool generates scenarios that exercise your instructions, measures Claude Code's output quality against defined criteria, then makes targeted changes — one at a time — to find which instruction tweaks produce measurably better results.
Any skill type — coding assistants, code reviewers, test generators, documentation writers, refactoring agents, and more. The tool classifies the skill type automatically from your CLAUDE.md and builds appropriate benchmarks for that category.
A benchmark is a structured test suite with defined inputs and pass/fail criteria. For a coding skill, this might be 'generate a function that passes these unit tests'. For a reviewer skill, it might be 'identify the three bugs in this code'. The benchmark runs the same tests before and after each CLAUDE.md change.
No. Auto Skill Improver generates benchmarks automatically based on your skill type and CLAUDE.md contents. It analyses what your instructions are trying to achieve and builds appropriate test scenarios. You can also provide custom benchmarks if you have specific requirements.
Benchmark saturation occurs when successive CLAUDE.md mutations stop producing score improvements. Your instructions have reached the ceiling of what the current benchmark can measure. You can either accept the current performance or create a harder benchmark that tests more advanced scenarios.
Manual editing is editorial: you rewrite instructions, they sound clearer, you keep them. But 'sounds clearer' isn't evidence. Auto Skill Improver is empirical: it establishes a baseline, changes one instruction at a time, re-runs the same benchmark, and keeps only what scores higher. It catches problems invisible during manual editing.
Yes. Auto Skill Improver is fully open source and free to use. The source code is available on GitHub at github.com/mlobo2012/auto-skill-improver. There are no usage limits, no API keys required for the tool itself, and no premium tiers.
Stop Guessing. Start Measuring.
Download the quickstart guide, clone the repo, and run your first Claude Code benchmark-driven improvement loop in under 10 minutes.