Auto Skill Improver for Claude Cowork — Benchmark-Driven Skill Optimisation
Test and improve your Cowork skill files, team workflows, and project instructions with empirical measurement. Inspired by Karpathy's autoresearch.
How It Works with Cowork
- 1Classify — the tool identifies your Cowork skill type (team workflow, project assistant, reviewer, etc.)
- 2Benchmark — it builds a test suite that exercises your Cowork instructions against real team scenarios
- 3Mutate — one instruction change at a time, each tested against the benchmark
- 4Keep or discard — only mutations that measurably improve Cowork output survive
Get the Guide File
Enter your email to download auto-skill-improver-cowork-quickstart.md and get access to the GitHub repository.
No spam, ever. Your email is stored securely so we can send you updates about new use cases and workflows.
Step-by-Step: Set Up Auto Skill Improver in Cowork
Download the quickstart file
Enter your email in the form above to download the Cowork quickstart file.
Open your Claude Cowork project
Navigate to the project where you want to improve a skill.
Upload the file to the project
Cowork reads the instructions and clones the repository automatically.
Point it at your project skill file or workflow instructions
The tool targets your skill file and begins setup.
Review baseline score, run mutations
Approve improvements, discard regressions — only gains are kept.
Export the improved skill back to your project
Once the benchmark saturates, your optimised skill file is ready to use.
Why Most Cowork Skill Iteration Fails
You update your Cowork skill file. The instructions sound more precise. You deploy them. But the team's actual experience didn't measurably improve. Most skill editing is editorial — rewriting based on intuition rather than evidence. Auto Skill Improver treats Cowork skill iteration as empirical, not editorial.
Skill Vibes
- ✗Reword skill instructions, hope Cowork performs better for the team
- ✗No baseline — no way to know if the team experience actually improved
- ✗Multiple instruction changes at once hide what actually helped
- ✗Subjective evaluation: 'the responses feel more helpful'
Skill Science
- ✓Establish a measurable baseline before any skill file changes
- ✓Mutate one instruction at a time
- ✓Run the same benchmark before and after each change
- ✓Keep only what scores higher — discard the rest
What It Finds in Cowork Skills
The tool surfaces structural problems in your Cowork skill files that are invisible during manual editing — issues that silently degrade team-facing performance across sessions.
Ambiguous Output Contracts
Vague success criteria that let Cowork produce inconsistent responses for team members.
Missing Fallback Behaviour
No defined recovery path when a tool call fails or returns unexpected data in a team context.
Conflicting Instruction Layers
Contradictory directives spread across system prompts, skill files, and project configurations.
Dependency & Portability Problems
Hard-coded paths, missing context, or assumptions that break across different team environments.
Weak Evidence Discipline
Claims without citations, assertions without data, decisions without reasoning chains.
Structural Formatting Issues
Inconsistent heading levels, broken markdown, or output that doesn't match the stated format.
The Karpathy-Inspired Method
Andrej Karpathy's autoresearch applies measurement discipline to research iteration. Auto Skill Improver applies the same principle to Cowork skill engineering — a controlled loop where every skill file change is accountable.
Classify the Skill Type
The tool analyses your Cowork skill file and determines its category — team assistant, project orchestrator, code reviewer, or something else. Classification informs what benchmarks make sense.
Build a Real Benchmark
Not a vibes check. A structured test suite with pass/fail criteria that exercises your Cowork instructions against representative team scenarios.
Establish a Baseline
Run the benchmark on the unmodified skill file. Record the score. This is your point of comparison — every mutation is measured against this baseline.
Mutate One Thing at a Time
Change a single instruction, add one constraint, remove one ambiguity. Never change multiple skill directives simultaneously — otherwise you can't attribute improvement.
Keep Only What Improves
Re-run the benchmark after each mutation. If the score goes up, the change stays. If it doesn't improve — or regresses — it gets discarded. No exceptions.
Stop When the Benchmark Saturates
When successive mutations stop producing gains, your skill file has reached its current ceiling. Further changes are noise, not signal. Move on or build a harder benchmark.
When to Use It — and When Not To
Best For
- Cowork skill files that define team-facing behaviour
- Project instructions for team workflows
- Cowork configurations that need measurable improvement
- Any Cowork setup where you need evidence that changes actually help the team
Not the Right Fit
- —One-off prompts you'll use once and discard
- —Creative tasks where there's no objective success metric
- —Skills that are already performing at ceiling
- —Situations where you can't define what 'better' means for the team
Frequently Asked Questions
Auto Skill Improver for Claude Cowork is an open-source tool that applies benchmark-driven iteration to your Cowork skill files and project instructions. It classifies your skill type, builds a test suite, establishes a baseline score, then systematically mutates one instruction at a time — keeping only changes that measurably improve Cowork performance for your team.
It treats your Cowork skill file as a testable artefact. The tool generates team scenarios that exercise your instructions, measures Cowork's output quality against defined criteria, then makes targeted changes — one at a time — to find which instruction tweaks produce measurably better results for team workflows.
Any Cowork skill type — team assistants, project orchestrators, code reviewers, documentation generators, onboarding helpers, and more. The tool classifies the skill type automatically and builds appropriate benchmarks for that category.
A benchmark is a structured test suite with defined inputs and pass/fail criteria. For a team assistant skill, this might be 'provide accurate answers to common project questions'. For a reviewer skill, it might be 'identify the key issues in this pull request'. The benchmark runs the same tests before and after each skill file change.
Absolutely. Auto Skill Improver is particularly valuable for shared Cowork skills because improvements compound across every team member's experience. A 10% improvement in a skill used by 20 people is 20x more impactful than improving a personal prompt.
Benchmark saturation occurs when successive skill file mutations stop producing score improvements. Your instructions have reached the ceiling of what the current benchmark can measure. You can either accept the current performance or create a harder benchmark that tests more advanced team scenarios.
Manual editing is editorial: you rewrite instructions, they sound more helpful, you deploy them. But 'sounds more helpful' isn't evidence. Auto Skill Improver is empirical: it establishes a baseline, changes one instruction at a time, re-runs the same benchmark, and keeps only what scores higher.
Yes. Auto Skill Improver is fully open source and free to use. The source code is available on GitHub at github.com/mlobo2012/auto-skill-improver. There are no usage limits, no API keys required for the tool itself, and no premium tiers.
Stop Guessing. Start Measuring.
Download the quickstart guide, clone the repo, and run your first Cowork benchmark-driven improvement loop in under 10 minutes.