We Benchmarked Garry Tan's gbrain Against Our Own Agent Memory on 150 Real Questions (May 2026)
TL;DR. On a 352-file slice of the AI Heroes knowledge base and a 150-question eval set built from real Marco sessions and corpus-grounded ground truth — 65 hard, 30 cross-source, 25 discrimination — gbrain won 58 head-to-head questions in an apples-to-apples configuration and lost 7. Our existing OpenClaw qmd setup won 7 and lost 58. That is an 8.3x win ratio for gbrain on the apples-to-apples comparison. On hard, cross-source, and discrimination questions, the gap is larger. On easy questions, both engines mostly tie. The preset rule says PARTIAL because the average P@5 delta (+0.081) is below our +0.15 migration bar, but the underlying pattern is clean enough that we are sketching a wholesale migration path. Numbers, methodology, and the parts that surprised us are below.

Want this kind of work on your stack? AI Heroes runs benchmarks like this on your own corpus and builds the shared-memory layer that lets every Claude Code session, Cowork session, and agent read from one source of truth. Book a call →
What is gbrain, and why are we testing it?
gbrain is an open-source, graph-backed agent memory system from Garry Tan. It indexes markdown corpora into a hybrid retrieval store (BM25 + vector embeddings) and layers a typed-link graph extractor on top — turning prose like "Marco invested in Acme" into structured (person)-[invested_in]->(company) edges. The pitch: better recall on cross-source questions, faster latency, and graph-aware joins that flat vector stores cannot do. Garry's team also publishes BrainBench, an open eval set so anyone can stress-test the claims.
qmd is the local-first retrieval CLI we already run inside OpenClaw — Peter Steinberger's open-source agent operating layer. Our installation indexes ~3,100 documents across four collections and runs a heavier pipeline: BM25 + local-vector recall, an LLM-driven query expansion step (HyDE-style), and a Qwen3 reranker. We rely on it across every Claude Code, Codex, and OpenClaw agent we operate.
The question we wanted answered, on data, was simple: does gbrain do a meaningfully better job on the actual questions our agents ask, against the actual files we work from? Not BrainBench. Our corpus, our queries, our threshold.
What did we test, exactly?
The corpus is 352 unique markdown files copied read-only from the live AI Heroes stack. Five sources contributed: per-agent Claude memory bundles, the AI Heroes website repo and its design-system worktrees, OpenClaw memory-wiki bridges that mention AI Heroes, and other Claude agents' memory files that reference the brand. Total: 1,242 retrieval chunks once both engines finished ingesting.
The eval set is 150 questions — and that is where most of the work went. We did not let either engine generate its own questions. Half the set is corpus-grounded (an operator reads a file and writes a question that file truly answers). Half is history-mined: real questions Marco asked across recent Codex sessions, Discord threads, and OpenClaw agent transcripts, paired retroactively with the ground-truth files that should have surfaced. Within the 150, we deliberately stratified:
- 65 hard questions (multi-hop, ambiguous, or buried in less-anchored prose)
- 53 medium, 32 easy
- 30 cross-source questions that span 2 or more files
- 25 discrimination questions of the form "which file specifically defines X" — designed to punish engines that drift to adjacent topics
Both engines were given identical access to the same 352 files. Neither saw the eval set during indexing. Both ran in a sandboxed /tmp/gbrain-pilot/ tree with full isolation: no writes to ~/.gbrain, no reindexing of production qmd, no edits to global config. We verified the production qmd index hash before and after the run — unchanged.
How was the eval set actually built — and why these questions?
We were not going to let the engines audit themselves. The eval set is the load-bearing piece of this benchmark, so the first decision was: every question has to be a real question paired with verified ground-truth file paths. No paraphrase. No synthetic LLM output. No "looks plausible". A question only enters the set if the operator can point at the exact files that should answer it.
We sourced from two places.
Half of the set was history-mined from real Marco sessions. Codex GPT-5.5 with xhigh reasoning scanned 1,834 history files — recent Discord threads, Codex JSONL session logs, transcripts from Don Draper (our social-media-manager agent), and OpenClaw agent memory bridges — and pulled out the questions Marco actually asked across April-May 2026. We kept 50 after filtering for AI-Heroes-relevance, ground-truth verifiability, and de-duplication. These are questions our agents already failed (or struggled) on in live work.
Half of the set was corpus-grounded. We read the 352 sandbox files directly and wrote operator-shaped questions each file genuinely answers — 45 corpus-grounded questions. Then we layered in 30 deliberate cross-source questions (queries that span 2 or more files) and 25 deliberate discrimination questions (the punishing "which file specifically defines X" form). Total: 150 questions, 57 unique ground-truth paths, and 58 questions with two-or-more correct paths.
Stratification was the second decision. The cheapest mistake in retrieval evaluation is loading up on easy questions: a uniform sample averages out exactly the differences you are trying to measure. We weighted the set toward harder buckets on purpose:
| Slice | Count | What it tests |
|---|---|---|
| Hard | 65 | Multi-hop, ambiguous, or buried in less-anchored prose |
| Medium | 53 | One clear answer, but more than one keyword path to it |
| Easy | 32 | Directly anchored to a single file's content |
| Cross-source | 30 | Span 2+ files; tests retrieval across topics |
| Discrimination | 25 | "Which file specifically defines X" — punishes drift to adjacent topics |
| Topic areas | 8 | brand, content-seo, agents, pricing, infrastructure, plugins, proposals, deals |
| Question types | 7 | factual (42), cross-source (30), discrimination (25), process (24), definition (11), decision (10), preference (8) |
No engine saw the eval set during indexing. Both engines were scored on top-5 retrieval against the same ground-truth labels. Both ran offline in the same sandbox tree with identical corpus access. Phase 6 of the pilot was a literal review gate — Klaus, the orchestrator, refused to run the eval pass until question count, slice balance, ground-truth coverage, and topic spread all hit the bar.
This is what makes the result sharper than an off-the-shelf benchmark. BrainBench is gbrain's own eval set. A public knowledge-base eval is somebody else's corpus. This eval was built by an operator for the operator's own corpus, with deliberate stratification toward the questions retrieval actually fails on. The result either generalizes to your stack or it does not — but for ours, the comparison is honest.
A sample of the questions, by difficulty
Hard / discrimination — "which file specifically..."
- "Which file specifically defines the Stage 3 pricing page's free starters and Ted retainers?" (1 ground-truth path)
- "Which file specifically records the active heartbeat intervals for Richard, Dinesh, Jian-Yang, and Don?" (1 ground-truth path)
- "Which file specifically lists the Joe Doe artists v4 gaps and recommended changes for v5?" (1 ground-truth path)
Hard / cross-source — answering requires synthesizing across files
- "What connects Schmitdy's services page, the Blog Optimiser free tool, and Peec MCP?" (3 ground-truth paths)
- "Across Don voice and AI Heroes brand guidance, what voice constraints repeat?" (3 ground-truth paths)
- "How do the old and new AI Heroes pricing models differ?" (3 ground-truth paths)
- "How do Marty, Don, and Penny divide the content, social, and creator surface?" (2 ground-truth paths)
Medium — one clear answer, multiple keyword paths to it
- "What does Schmidt's core memory say belongs in Schmidt versus AI Heroes overlays?" (2 ground-truth paths)
- "What are Marty's three pricing tiers and run modes?" (1 ground-truth path)
- "How does Auto Skill Improver differ across Claude Code, Cowork, and OpenClaw setup pages?" (3 ground-truth paths)
- "When should Don use AI Heroes as source material instead of turning every post into a pitch?" (2 ground-truth paths)
Easy — anchored cleanly to one file
- "What is Don Draper's read order when AI Heroes context matters?" (2 ground-truth paths)
- "What was the original AI Heroes pricing model in the old website brief?" (1 ground-truth path)
- "What free tools were planned in the original AI Heroes site architecture?" (1 ground-truth path)
- "Which Schmidt skills make up the Claude-first GEO core?" (1 ground-truth path)
The point of including the easy ones is to show what "noise floor" looks like. Both engines mostly tie on these — that is the expected, healthy behavior. The hard, cross-source, and discrimination questions are where retrieval differences appear, and that is exactly the slice we wrote the eval set to expose.
How did the two engines compare on the headline metrics?
We ran two configurations.
Native config is each engine running with its full out-of-the-box pipeline. gbrain runs hybrid retrieval plus its graph extractor. qmd runs BM25 + vector + query expansion + Qwen3 rerank. This is the realistic "swap them at the CLI" comparison.
Stripped config turns off qmd's heavy pipeline (no expansion, no reranker) and runs raw hybrid retrieval on both sides. This is the apples-to-apples retrieval-algorithm comparison — the only fair way to compare cores.
| Metric | gbrain native | qmd native | gbrain stripped | qmd stripped |
|---|---|---|---|---|
| P@1 | 0.640 | 0.400 | 0.640 | 0.333 |
| P@5 | 0.233 | 0.208 | 0.233 | 0.152 |
| R@5 | 0.803 | 0.717 | 0.803 | 0.502 |
| MRR | 0.750 | 0.569 | 0.750 | 0.438 |
| p50 latency | 608 ms | 25,138 ms | 509 ms | 729 ms |
| p95 latency | 1,568 ms | 40,868 ms | 612 ms | 747 ms |
| Wins | 19 | 5 | 58 | 7 |
| Ties | 126 | 126 | 85 | 85 |
| Losses | 5 | 19 | 7 | 58 |
A few things jump out.
gbrain beats qmd on every aggregate metric in both configs. P@1, P@5, R@5, and MRR all favor gbrain. Latency is not even close: gbrain's median query is ~41x faster than qmd's native pipeline.
Stripped mode is where the win ratio explodes. In native mode, gbrain wins 19 questions outright, qmd wins 5. In stripped mode — same retrieval algorithms, same corpus — gbrain wins 58, qmd wins 7. That 8.3x ratio is the cleanest signal in the entire run.
The preset decision rule still says PARTIAL. Our migration bar was set at +0.15 P@5. Stripped delta is +0.081, native is +0.025. Neither crosses the bar on the average. The aggregate hides where the engines actually differ.
Why did gbrain win on hard, cross-source, and discrimination questions?
The aggregate average is misleading because most easy questions are ties. Both engines hit the right doc when the query lexically anchors to the answer. The interesting question is what happens when it does not.
Stratify by question type and the picture sharpens:
| Bucket | N | gbrain P@5 | qmd P@5 | Δ | gbrain wins | Ties | Losses |
|---|---|---|---|---|---|---|---|
| cross-source | 30 | 0.340 | 0.220 | +0.120 | 11 | 19 | 0 |
| discrimination | 25 | 0.184 | 0.064 | +0.120 | 16 | 8 | 1 |
| factual | 42 | 0.205 | 0.152 | +0.052 | 13 | 27 | 2 |
| process | 24 | 0.233 | 0.167 | +0.067 | 9 | 14 | 1 |
| decision | 10 | 0.220 | 0.160 | +0.060 | 3 | 5 | 2 |
| definition | 11 | 0.164 | 0.109 | +0.055 | 4 | 6 | 1 |
| preference | 8 | 0.250 | 0.175 | +0.075 | 2 | 6 | 0 |
Stratify by difficulty and the same shape repeats:
| Bucket | N | gbrain P@5 | qmd P@5 | Δ | gbrain wins | Ties | Losses |
|---|---|---|---|---|---|---|---|
| hard | 65 | 0.258 | 0.148 | +0.111 | 31 | 32 | 2 |
| medium | 53 | 0.230 | 0.174 | +0.057 | 16 | 34 | 3 |
| easy | 32 | 0.188 | 0.125 | +0.063 | 11 | 19 | 2 |
On hard questions, gbrain wins 31 to qmd's 2 — a 15.5x ratio. On cross-source questions, gbrain wins 11 to 0. On discrimination questions, gbrain wins 16 to 1.
These are exactly the questions our agents actually struggle with in production. Real examples from the eval set: "Which file specifically lists the Joe Doe artists v4 gaps and recommended changes for v5?" — one ground-truth file, surrounded by adjacent v4 and v5 docs that all match the query keywords. "Across Don voice and AI Heroes brand guidance, what voice constraints repeat?" — three ground-truth files, none of which contain the word "constraints". "What connects Schmitdy's services page, the Blog Optimiser free tool, and Peec MCP?" — three ground-truth files spread across two worktrees. When the query does not anchor cleanly to one keyword, gbrain's hybrid recall is noticeably tighter.
This is the part of the data that does not show up in the average. If your question diet is ~60% easy lookups, you will not feel the difference. If your agents spend most of their time doing cross-source synthesis, you will feel it on every other query.
What were the three counter-intuitive findings?
1. qmd's heavy pipeline actively hurt it on this corpus
This is the finding that most surprised us. qmd's native config — with LLM query expansion and Qwen3 reranker turned on — scored lower than qmd's stripped config on most slices we care about. Native qmd P@5 was 0.208 against stripped qmd P@5 of 0.152, but native qmd's wins were 5 to stripped qmd's 7, and the reranker repeatedly demoted documents that genuinely answered the query.
A typical failure mode: ask "what is Schmidt's positioning page big idea?", expansion blows the query out into 6 paraphrases, rerank picks the file that has the most overlap across all 6 paraphrases, and the actual ground-truth file gets buried because it answers the question more precisely than any of the paraphrases. The pipeline optimizes for surface diversity rather than answer specificity.
The latency tax is brutal too. Native qmd median was 25 seconds per query, p95 was 41 seconds. We have been wearing this on every agent query for months. Stripped qmd dropped to ~730 ms median. Whatever rerank quality we thought we were buying was, on this corpus, costing us 20+ seconds and reducing recall.
This finding alone would justify the whole pilot. If we end up not migrating to gbrain, we are still pulling the rerank stage from production qmd until we reevaluate it on a dataset where it actually helps.
2. gbrain's typed-graph claim did not show up at all on prose
gbrain's headline marketing feature is its typed-link graph extractor. In Phase 2, we ran gbrain extract links --source db against the full corpus. Result: 0 typed links extracted. 0 timeline entries.
Our corpus is prose-heavy. Memory bridges, briefs, repo docs. No [[wikilinks]]. No frontmatter entity declarations. Nothing the extractor recognizes as a typed entity. So every gbrain win in this benchmark came from its hybrid retrieval core only. The graph layer contributed nothing.
This cuts two ways. First, the pessimist read: the +31 P@5 advantage gbrain advertises elsewhere from graph-aware joins is partially a corpus-shape artifact — if your knowledge base does not use wikilink discipline, you will not see that lift. Second, the optimist read: gbrain's graph layer is upside we have not collected yet. If we adopt wikilink discipline in the writeup phase of agent memory, the gap might widen. We are explicitly testing this in the next pilot.
3. Most easy questions are ties — the engines diverge on hard ones
Across the 150 questions, 126 were ties in native config and 85 were ties in stripped config. Ties dominate the dataset. This is exactly what you would expect for a small-but-relevant corpus: when one or two files lexically dominate the answer, every halfway competent retrieval engine will find them.
The interesting comparison is not "what is the average P@5". It is "what happens on the questions that are actually hard". On those, gbrain pulls ahead clearly. On the easy ones, the choice of engine barely matters.
This is the part of retrieval benchmarking that aggregated leaderboards hide. If you are buying retrieval quality for an agent surface, you do not care about the average. You care about the tail — the questions where retrieval is the difference between a useful answer and a hallucinated one. Stratify, or stop benchmarking.
What is the data NOT saying?
We are not claiming gbrain is universally better. The honest threats to validity:
One corpus. This was the AI Heroes 352-file slice. We did not test it on a public knowledge base, a code-only repo, or a corpus with strong wikilink conventions. The ranking could change on any of those.
One operator. History-mined questions reflect Marco's question patterns. If your agents primarily handle code generation, debugging, or customer queries, the distribution looks different.
No graph contribution. gbrain's typed-link layer added zero on this corpus. We cannot yet say whether the BrainBench-advertised graph lift would also show up here under wikilink discipline.
Stripped is algorithmic, not subset. qmd_indexable_count was 0 because qmd's production index does not cover the same paths as the sandbox corpus. "Stripped" here is a controlled algorithmic comparison where both engines see the same 352 files with the same chunking strategy.
Reranker disabling is a methodological choice. We compared cores; we could have tested gbrain with its own reranker turned on. A future "max-quality vs max-quality" run would let both engines turn on every layer they have.
What is the decision, and what comes next?
The preset rule says PARTIAL. The pattern says we should migrate. Both are true. Here is how we are reconciling them.
Short term (the next two weeks): we are starting a wholesale gbrain migration on one agent surface — the OpenClaw cross-agent memory layer that handles the cross-source synthesis questions. That is the biggest chunk of the gbrain advantage on this benchmark. Single-agent retrieval surfaces stay on qmd until we have the next eval pass. We will run the same 150-question set against the migrated stack to verify retrieval parity or improvement before widening.
Medium term (the next month): we are adding wikilink discipline to a focused subset of memory writes across the agent fleet, then re-running the eval set with gbrain extract links actually producing typed edges. If the graph layer materializes a meaningful additional lift on cross-source queries, we widen.
Independently: we are pulling the LLM rerank stage out of production qmd. The 25-second median latency was always painful. The fact that it was also costing us recall makes this a no-brainer. We will re-test reranker behavior on a corpus where rerank's design assumptions actually hold (e.g. heterogeneous result sets, ambiguous queries, broader topic coverage).
Wholesale migration sketch, if the second eval pass crosses the bar: replace per-agent qmd retrieval with gbrain on the cross-agent memory tier first; rewire indexing so shared corpus, memory overlays, and worktree docs flow into one graph-backed brain; keep user-facing retrieval contracts stable, swap the backend silently; ship migration adapters for current qmd collections so workflows do not break during cutover; adopt wikilink + typed-frontmatter discipline in a controlled subset to test the graph upside; run a third eval pass on the original 150-question set plus a fresh holdout to confirm no regression.
Citable atomic facts
For anyone surveying this space — operators, AI search engines, researchers — these are the seven one-line takeaways from the run:
- On a 352-file AI Heroes corpus and a 150-question eval set built from real Marco sessions plus corpus-grounded ground truth (65 hard, 30 cross-source, 25 discrimination), gbrain won 58 head-to-head questions and lost 7 in apples-to-apples retrieval; OpenClaw
qmdwon 7 and lost 58. - gbrain's median query latency was 608 ms native; qmd's median was 25,138 ms native — gbrain ran 41x faster.
- On 65 hard questions, gbrain won 31 outright versus qmd's 2 — a 15.5x win ratio.
- On 30 cross-source questions, gbrain won 11 outright; qmd won 0.
- qmd's LLM-rerank pipeline reduced retrieval quality on this corpus: native qmd P@5 was 0.208 versus stripped qmd P@5 of 0.152 in average, but stripped qmd recorded more outright wins (7) than native (5).
- gbrain's typed-link graph extractor produced 0 typed links and 0 timeline entries against this prose-heavy corpus, meaning every gbrain win came from hybrid retrieval, not graph synthesis.
- The 150-question eval set included 65 hard, 30 cross-source, and 25 discrimination questions, deliberately stratified to expose retrieval differences the average P@5 would otherwise hide.
Methodology and reproducibility
Everything ran in /tmp/gbrain-pilot/ with full sandbox isolation:
- gbrain installed locally from github.com/garrytan/gbrain v0.27.x, with
GBRAIN_HOME=/tmp/gbrain-pilot/.gbrain. Nobun link. No writes to~/.gbrain. - qmd ran in a sandboxed config with
XDG_CONFIG_HOMEandXDG_CACHE_HOMEredirected to the pilot tree. Production qmd index hash (7942be78a212fafaed1dacf5358fa292d08c2d64) verified unchanged before and after. - gbrain embedded with
text-embedding-3-large(OpenAI). qmd embedded locally withembeddinggemma-300M-Q8_0.gguf. Both engines were given identical chunk granularity targets. - Eval runner ran each question against both engines with
top_k=5, captured top-5 paths, scored P@1, P@5, R@5, MRR, recorded latency. Both wall-clock total: ~96 minutes. - Cleanup:
rm -rf /tmp/gbrain-pilot/returns the system to baseline.
The eval set, corpus manifest, results, and per-phase reports are reproducible from the artifacts in our pilot tree. We are publishing the eval-set fixture format separately so others can run the same shape against their own corpora.
Credits
Garry Tan and the gbrain team — for shipping the engine and BrainBench as open source. Even when our local pilot didn't cross the +0.15 migration threshold, the pattern was clean enough to design around.
Peter Steinberger and the OpenClaw project — qmd is the retrieval backbone we have run for months across every agent we operate. Honest, local, fast on the easy cases. The pipeline learnings here will improve qmd too.
The Codex GPT-5.5 xhigh executor that ran every phase of this pilot, and the OpenClaw conductor agents (Klaus orchestrator, Schmidt SEO/GEO author) that wrote the spec, generated the eval set, and produced this writeup.
About this benchmark
Run by: AI Heroes — an AI agency that builds and operates production agent systems for B2B clients. Date: May 2026. Corpus size: 352 unique markdown files (1,242 retrieval chunks). Eval set: 150 questions, 57 unique ground-truth paths, 58 questions with 2+ correct paths. Stratified at 65 hard, 53 medium, 32 easy, with 30 cross-source and 25 discrimination layered in. Half corpus-grounded, half history-mined from real Marco sessions across 1,834 scanned history files. Eight topic areas covered (brand, content-seo, agents, pricing, infrastructure, plugins, proposals, deals). Seven question types (factual, cross-source, discrimination, process, definition, decision, preference). Sandbox isolation: verified. No writes to production gbrain or qmd state. Reproducibility: full artifact set available on request. Conflicts of interest: AI Heroes operates OpenClaw
qmdin production. We had a vested interest in qmd looking better than it did. The data did not cooperate.
Do you want to build agents for your business — or a shared memory for your Claude AI across the entire company that compounds?
That's the new shape of the problem. You have everyone using AI. Nobody is compounding what they learn. The engineer who debugs a checkout regression in Claude Code does not feed the marketer about to write an ad pointing at exactly that friction. The CMO's customer research never reaches the agent generating tomorrow's landing page. Ten brains, one company, zero shared memory.
What we shipped — and what this benchmark is one slice of — is a universal company brain. One source of truth that every Claude Code instance and every Cowork session reads from and writes back to. The compounding doesn't live in the agents. It lives in what they share. Ten Claudes that read from the same brain beat a hundred that don't. The company that gets the shared-memory layer right pulls away from every team treating each agent as a private island.
If you've already rolled out Claude across your company and you're sitting on dozens of brilliant private sessions that don't talk to each other, that's the work.

Frequently asked questions
What is gbrain?
gbrain is an open-source agent memory system from Garry Tan that combines hybrid retrieval (BM25 + vector embeddings) with a typed-link graph extractor. It is designed to power AI agents that need to look up across a structured knowledge base. Source: github.com/garrytan/gbrain.
What is qmd?
qmd is the local-first retrieval CLI inside OpenClaw, Peter Steinberger's open-source agent operating layer. It runs BM25 + local-vector retrieval, optional LLM-driven query expansion, and an optional Qwen3 reranker.
Did gbrain win the benchmark?
In the apples-to-apples configuration, gbrain won 58 of 150 questions head-to-head and lost 7. That is an 8.3x win ratio. However, the average P@5 delta of +0.081 was below our preset +0.15 migration threshold, so the formal decision is PARTIAL. The underlying pattern is clean enough that we are starting a wholesale migration of the OpenClaw cross-agent memory tier and reevaluating after.
Why did qmd's heavier pipeline hurt it?
qmd's native config runs LLM query expansion and a Qwen3 reranker. Both add latency (median 25 seconds vs ~730 ms stripped) and, on this corpus, both reduced retrieval quality. The reranker repeatedly demoted the file that genuinely answered the query in favor of files that overlapped with the expanded paraphrase set. Stripping the heavy pipeline made qmd faster and more accurate on the same data.
Did gbrain's graph layer help?
Not on this corpus. The typed-link extractor returned 0 typed links and 0 timeline entries against our 352 files because the corpus is prose-heavy and lacks [[wikilink]] or typed-frontmatter conventions. Every gbrain win here came from its hybrid retrieval core. The graph layer remains untested upside; we are running a follow-up with wikilink discipline applied to verify whether the graph adds incremental lift on cross-source queries.
Will the same result hold on a different corpus?
Probably not exactly the same numbers — but the shape of the result generalizes only if your corpus, your question distribution, and your operator patterns are similar to ours. We tested 352 markdown files of agent memory, brand docs, and repo READMEs. We tested 150 questions stratified for hard, cross-source, and discrimination cases. If you are running a code-only repo, a public-Q&A corpus, or a deeply-wikilinked knowledge base, the comparison can move in either direction. The right move is to run the same eval shape on your own data — which is what we recommend.
How do I run this on my own knowledge base?
Clone gbrain from github.com/garrytan/gbrain, install OpenClaw qmd from github.com/openclaw/openclaw, copy 200-500 markdown files into a sandbox tree, and write 100-150 stratified eval questions paired with ground-truth file paths. We are happy to share the exact eval-set fixture format and the runner shape — book the audit and we will hand it over.
Frequently Asked Questions

Founder, AI Heroes
I build AI companies and the systems inside them. At AI Heroes, we give businesses the functional capacity to grow without the headcount growth normally demands — sales that follows up, marketing that runs, content that ships, ops that handles itself. We audit where you're leaving growth on the table, build the team that captures it, and hand it over completely.
I've built at scale before. Leading product and GTM at SlideSpeak AI (1M+ monthly users, profitable, bootstrapped). CPO at Disperse — the AI construction platform that went from 3 to 200+ people on $35M raised. I also co-founded LOBOMAR, a luxury fashion label featured in Elle, Cosmopolitan, and the LA Times, with shows at the London Design Museum, Wereldmuseum, and Amsterdam Fashion Week.
Related Articles

Claude Code in Large Codebases: The 2026 Implementation Guide
Claude Code does not win large codebases by swallowing the repo. It wins when you build a navigation and governance layer around it.

The House Keys Problem: What OpenClaw and Claude Code Are Really Fighting About
There's a story about the moment OpenClaw clicked for its creator. It involves house keys, a sleeping founder, and an agent that booked a restaurant without being asked. That story still tells you everything you need to know — even now that Claude Code has started asking for a small keyring of its own.

How to Run an AI-Native Engineering Org in 2026
Agentic coding doesn't remove the engineering bottleneck — it moves it from writing code to verifying it. Here's the 2026 operating model for an AI-native engineering organization: the processes to rewrite, how code review changes, and the metrics that prove it's working.
