Question 1

What is Auto Skill Improver?

Accepted Answer

Auto Skill Improver is an open-source tool that applies benchmark-driven iteration to AI skill files. It classifies your skill, builds a test suite, establishes a baseline score, then systematically mutates one thing at a time — keeping only changes that measurably improve performance. It works with Claude Code skills, OpenClaw agent configurations, and Cowork skill files.

Question 2

How does it relate to Karpathy's autoresearch?

Accepted Answer

Andrej Karpathy's autoresearch methodology applies measurement discipline to research iteration — running controlled experiments instead of making intuitive changes. Auto Skill Improver applies the same principle to prompt engineering: every change is tested against a benchmark, and only measurable improvements survive. It's the difference between prompt vibes and prompt science.

Question 3

What types of skills can it improve?

Accepted Answer

Any skill file that Claude Code, Cowork, or OpenClaw uses — coding assistants, research agents, workflow orchestrators, code reviewers, and more. The tool classifies the skill type automatically and builds appropriate benchmarks for that category.

Question 4

How do benchmarks work in prompt engineering?

Accepted Answer

A benchmark is a structured test suite with defined inputs and pass/fail criteria. For a coding skill, this might be 'generate a function that passes these unit tests'. For a research skill, it might be 'find and correctly cite three sources on this topic'. The benchmark runs the same tests before and after each change, producing a numerical score that makes improvement objective.

Question 5

Can it improve any LLM prompt or just Claude Code?

Accepted Answer

The tool is designed for Claude Code, Cowork, and OpenClaw skill files — structured prompt files that define agent behaviour. It's not a general-purpose prompt optimiser for one-shot prompts. The value comes from iterating on skills that are used repeatedly, where small improvements compound across hundreds of runs.

Question 6

What does 'benchmark saturation' mean?

Accepted Answer

Benchmark saturation occurs when successive mutations stop producing score improvements. It means the skill has reached the ceiling of what the current benchmark can measure. At this point, further changes are noise rather than signal. You can either accept the current performance or build a harder benchmark that tests more advanced capabilities.

Question 7

How is this different from manual prompt iteration?

Accepted Answer

Manual iteration is editorial: you rewrite a prompt, it sounds better, you keep it. But 'sounds better' isn't evidence. Auto Skill Improver is empirical: it establishes a baseline, changes one variable at a time, re-runs the same benchmark, and keeps only what scores higher. It also catches structural problems — contradictory instructions, missing fallback behaviour, ambiguous output contracts — that are invisible during manual editing.

Question 8

Is it free and open source?

Accepted Answer

Yes. Auto Skill Improver is fully open source and free to use. The source code is available on GitHub at github.com/mlobo2012/auto-skill-improver. There are no usage limits, no API keys required for the tool itself, and no premium tiers.

Auto Skill Improver — Benchmark-Driven Prompt Optimisation for Claude Code & OpenClaw

Choose Your Platform

For Claude Code

For Claude Cowork

For OpenClaw

How It Works

Get the Guide File

Setup Instructions

Choose your platform above

Download the quickstart file

Upload the file to your AI tool

Point it at any skill or instruction file

Review baseline, run mutations, keep improvements

Done when the benchmark saturates

Why Most Prompt Iteration Fails

Prompt Vibes

Prompt Science

What Auto Skill Improver Finds

Ambiguous Output Contracts

Missing Fallback Behaviour

Conflicting Instruction Layers

Dependency & Portability Problems

Weak Evidence Discipline

Structural Formatting Issues

The Karpathy-Inspired Method

Classify the Skill Type

Build a Real Benchmark

Establish a Baseline

Mutate One Thing at a Time

Keep Only What Improves

Stop When the Benchmark Saturates

When to Use It — and When Not To

Best For

Not the Right Fit

Frequently Asked Questions

Stop Guessing. Start Measuring.