Your PM Loop, Running Itself — With Evidence
Anthropic's product-management plugin is a brilliant advisor. It drafts specs, reviews metrics, and brainstorms ideas — and then it waits for you to act. This plugin does the acting: it moves the Linear card, argues against the spec it just wrote, remembers whether last quarter's bet actually worked, and interrupts you only when a decision is genuinely irreversible.
Built directly on Anthropic's official product-management plugin. The 7 inherited PM skills work from day one. Always-on unattended operation — scheduled signal pulls, autonomous writes to your tools — activates when you connect your own Linear, Notion, or Amplitude. This is a working multi-agent PM system with a real enforcement spine you can run today; it is not zero-setup full autonomy out of the box.
Why Most AI PM Tools Ship the Wrong Things Faster
Every documented AI PM failure follows the same pattern: the model wrote a confident spec, nothing argued against it, nobody checked whether it shipped and got used, and the team shipped the wrong thing more efficiently than before. Pendo's data shows 80% of features are rarely or never used. Kohavi's Microsoft data shows only about one-third of expert-designed bets actually move the target metric. Cemri et al. found naive multi-agent systems failing 41–86% of the time. The failure is never the model — it is the absence of a gate, an adversary, and a closed loop.
How AI PM tools break in production
- ✗The assistant writes a confident RICE score from numbers a stakeholder asserted. Nobody can trace why that item is at the top of the backlog.
- ✗A spec gets written. Nothing argues it will fail. The build ships. Six months later, adoption is 3% of target and there is no record of what was predicted.
- ✗Metrics are reviewed when someone remembers to run the command — not when a roadmap assumption just got falsified by last week's data.
- ✗An autonomous action runs on a live system. Nobody knows on what evidence, or what the dissenting view was.
How this plugin is engineered differently
- ✓Every priority score, spec verdict, and launch decision must cite a real Evidence Store record — or it is blocked at the hook level, not softened.
- ✓Four adversarial verifiers are structurally opposed to the agents that produce confident outputs. They cannot be skipped to hit a date.
- ✓Every shipped bet pre-registers its predicted metric and target. A Victory-Bias Auditor makes 'success' impossible in code without clearing that pre-registered bar.
- ✓Signal ingestion is scheduled, not on-demand. The Drift Sentinel flags falsified assumptions proactively — before the next quarterly review.
One Product Bet, From Signal to Verdict — No PM Loop Required
Here is what actually happens when a new signal surfaces. Follow it through all five pods. The gates are not suggestions — they block or escalate.
- 1
Sense — Signal Ingestor + Drift Sentinel (Pod 2)
The Signal Ingestor's scheduled pull ingests new customer tickets, analytics movements, call transcripts, and competitive changes into the Evidence Store — each record carrying a source URI, capture timestamp, and strength rating. The Drift Sentinel diffs the fresh signal against your strategy documents and raises a flag when a roadmap assumption is contradicted, citing both the standing claim and the opposing evidence. No flag can be raised without both sides present.
- 2
Decide — Evidence Librarian + Prioritization Analyst + Evidence Adversary (Pod 1)
The Evidence Librarian resolves each backlog candidate to its supporting evidence records. The Prioritization Analyst proposes RICE scores where every component cites real evidence IDs — a Confidence score with no linked record is auto-clamped. The Evidence Adversary then runs against the top items: it must surface a disconfirming signal or certify none exists. Any reprioritisation above the reversibility threshold stops for a human review — 20 seconds, full dissent visible.
- 3
De-Risk — Spec Author + Assumption Mapper + Pre-Mortem Red Team + Discovery Runner (Pod 3)
The Spec Author drafts the PRD. The Assumption Mapper extracts every load-bearing assumption and plots it on a riskiness-versus-evidence map. The Pre-Mortem Red Team assumes the feature already shipped and failed — it must produce the strongest disconfirming case and the cheapest experiment to kill the riskiest, weakest-evidence assumption first. The build epic in your tracker is gated on assumption-status. The Discovery Runner launches the test. The human is the final ship-gate approver — always — and sees the Red Team's kill-case before clicking approve.
- 4
Learn — Launch Tracker + Adoption Auditor + Outcome Adjudicator + Victory-Bias Auditor (Pod 4)
At ship time, the Launch Tracker pre-registers the predicted metric, target, and decision date. At the decision date, the Adoption Auditor pulls actuals from your analytics. The Outcome Adjudicator renders iterate / hold / double-down / kill with the delta cited. The Victory-Bias Auditor checks against the pre-registered target — not a post-hoc redefinition. A 'success' verdict is mechanically impossible in code if the actual metric did not clear the original target. Failed bets are written back as evidence records, which feed the next Prioritization and Drift cycles.
- 5
Align — Alignment Broadcaster + Escalation Router + Consensus Checker (Pod 5)
Every autonomous decision emits an audience-tailored update carrying the evidence IDs and the adversary's dissent. The Escalation Router classifies each decision by reversibility and blast radius — the rare strategic call goes to the human as a decision-ready brief with a deadline; everything reversible and confident flows without interruption. The Consensus Checker scans continuously for conflicts between pods or against standing commitments — any open conflict blocks the decision until reconciled.
Six Things the System Does That a PM Assistant Cannot
Seventeen specialist agents across five pods. Four of them — the Evidence Adversary, the Pre-Mortem Red Team, the Victory-Bias Auditor, and the Consensus Checker — exist for one purpose only: to block a confident-wrong call before it becomes a shipped feature no one uses.
Acts on your tools — with gated WRITE/ACT
Agents operate over your connector MCPs: Linear or Jira for the roadmap, Notion for knowledge, Amplitude or Pendo for metrics, Slack for broadcasts. Two hooks — evidence-gate.sh and human-escalation.sh — decide whether an action can proceed automatically or must stop for human approval. Example: 'Reprioritise my backlog from this week's signal' actually moves the Linear cards, opens the investigation ticket, and posts the Slack update — when the action is reversible and well-evidenced.
Evidence substrate — no claim without a citation
lib/evidence.py records every source-backed evidence entry. hooks/evidence-gate.sh blocks any priority score, spec verdict, or launch decision that does not cite a real evidence record. There is no soft path — missing citations cause a hard block, not a warning. Example: HiPPO dies here. Nothing gets prioritised 'because the CEO asked'; it cites a metric, ticket cluster, or transcript, or it is rejected at the gate.
4 adversarial verifiers that argue against the confident call
Four verifiers — Evidence Adversary, Pre-Mortem Red Team, Victory-Bias Auditor, and Consensus Checker — are placed where confident mistakes are most expensive. lib/launch.py makes a 'validated' outcome mechanically impossible unless the actual metric clears the pre-registered target. Example: 'Spec out SSO' returns the strongest case it will flop plus the cheapest test to kill the risk first, before engineering commits a single sprint.
Always-on sensing + two-sided drift detection
A Signal Ingestor runs on a schedule, not on demand. A Drift Sentinel uses lib/drift.py's two-sided rule: it cannot call a roadmap assumption contradicted without both the documented claim and the opposing signal — preventing hallucinated drift. Example: 'Which roadmap assumptions did the last two weeks of data just falsify?' is pushed proactively instead of waiting for a quarterly review.
Closed loop with memory — bets, actuals, verdicts
A Launch Tracker pre-registers the success metric and decision date at ship time. An Adoption Auditor pulls actuals at the decision date. An Outcome Adjudicator writes the next iterate ticket, hold signal, or sunset recommendation — with the actual-vs-target delta cited. Example: 'Did what we shipped last quarter get adopted, and what should we cut?' becomes a routine operating question with an evidence trail — not a quarterly retrospective you have to schedule.
Calibrated human-escalation boundary
lib/escalation.py classifies every action by reversibility, blast radius, confidence, and agent disagreement. hooks/human-escalation.sh blocks any auto-action that hits the escalation threshold. Every autonomous decision is logged with its evidence IDs and dissent for full audit. Example: You watch a dashboard and get pinged only for the rare call — 'kill this live feature?' or 'override this reprioritisation?' — not 'which competitor should I analyse?'
Built on Anthropic's Official Product-Management Plugin
This plugin starts where Anthropic's official product-management plugin leaves off — it does not replace it. Anthropic ships seven well-crafted PM skills. They are the foundation. Every skill, command, and PM framework is kept intact. What this plugin adds is the autonomous operating layer the original deliberately leaves to the builder: the right to act on your systems, a team of cross-checking agents, an evidence substrate, and four adversarial verifiers that argue back. Here is exactly what Anthropic's plugin does in each area, and what the autonomous layer turns it into.
View Anthropic's official product-management plugin on GitHub
Spec Writing — /write-spec
Out of the box
You run /write-spec with a problem or idea. The skill drafts a clear PRD: requirements, scope, success metrics, and open questions. It ends by asking: 'Want me to add an edge-case section? Want me to tailor this for engineering?'
With the autonomous agents
The Spec Author still drafts the PRD. Then an Assumption Mapper extracts every load-bearing assumption and tags it by evidence strength. A Pre-Mortem Red Team then argues the feature already shipped and failed — and names the cheapest test to kill the riskiest, weakest-evidence assumption first. The build ticket in your tracker is gated until that test runs or strong evidence already exists on record.
Roadmap Update — /roadmap-update
Out of the box
You run /roadmap-update. The skill helps create, refresh, or reprioritise a roadmap using familiar planning formats. You supply the context; it formats the output and asks if you want to adjust timelines or add an exec summary.
With the autonomous agents
A Prioritization Analyst derives RICE scores from your actual analytics, ticket clusters, and research transcripts — every component must cite an Evidence Store record or it is auto-clamped to the floor. An Evidence Adversary then runs against the top items: it must either surface a disconfirming signal or certify none exists. When the reprioritisation moves a committed item, the human gets 20 seconds of review with the full dissent in front of them.
Metrics Review — /metrics-review
Out of the box
You run /metrics-review. The skill reads your connected analytics, surfaces trends, flags movements, and ends with suggested follow-up actions — a one-shot scorecard you read and then decide what to do about.
With the autonomous agents
A Signal Ingestor pulls metrics on a schedule — not on demand. A Drift Sentinel diffs the live signal against your strategy documents and flags when a roadmap assumption is contradicted. An Adoption Auditor then compares every shipped feature against its pre-registered success metric at the decision date. A Victory-Bias Auditor blocks any 'success' verdict that did not clear the original target — goalpost-moving is impossible in code.
Research Synthesis — /synthesize-research
Out of the box
You run /synthesize-research with interviews, surveys, tickets, and notes. The skill turns the raw material into themes, personas, and opportunities — a structured synthesis document ready for a planning session.
With the autonomous agents
A Signal Ingestor continuously ingests customer calls, support tickets, and competitor signals into a shared Evidence Store — not when you remember to ask. A Synthesis Editor writes the weekly digest and drift log. The Drift Sentinel compares new evidence against standing persona and assumption documents, and raises a 'contradicted' flag (citing both the claim and the opposing signal) when reality has moved.
Competitive Brief — /competitive-brief
Out of the box
You run /competitive-brief. The skill produces a competitor brief with feature comparisons, positioning gaps, and strategic implications — a polished document triggered when a competitor ships something or a sales call surfaces a gap.
With the autonomous agents
Competitive signal becomes a standing monitor lane inside Signal Ingestor, writing cited competitive evidence records on a schedule. When a competitor ships a feature that directly contradicts a roadmap assumption, the Drift Sentinel flags it proactively — not on the next time someone runs the command.
Stakeholder Update — /stakeholder-update
Out of the box
You run /stakeholder-update after a decision. The skill drafts audience-tailored updates for executives, engineering, customers, or cross-functional partners — a document you review and send when you remember to.
With the autonomous agents
An Alignment Broadcaster fires automatically on every autonomous decision — a reprioritisation, a ship-gate block, a launch verdict — and attaches the evidence IDs and the adversary's dissent to each. An Escalation Router classifies actions by reversibility and blast radius and routes only the rare strategic call to the human, packaged as a decision-ready brief. A Consensus Checker blocks any decision while two pods are in conflict.
Brainstorm — /brainstorm
Out of the box
You run /brainstorm to stress-test product ideas, explore problem spaces, and generate options — a thinking-partner session that diverges and then hands the options back to you.
With the autonomous agents
Divergent brainstorming is retained and now paired with adversarial convergence: the Pre-Mortem Red Team's generative half produces the strongest counter-case to the chosen option and the cheapest test to kill the riskiest assumption. The output is a spec with a test plan attached, not a list of ideas.
Is This the Right Tool?
Best Fit
- PMs, founders, and product operators already in Claude Code or Cowork, living in Linear, Notion, Amplitude, and Intercom — who want to supervise a PM loop rather than manually run one
- Teams where HiPPO or opinion-driven prioritisation is a recurring problem and evidence-grounded backlog management is the goal
- Product organisations that ship features that go unused and want a structural closed loop — pre-registered predictions, actuals pulled at decision date, verdicts written to the tracker
- Teams who already use Anthropic's product-management plugin and want to add the autonomous operating layer on top without replacing what works
Not the Right Fit
- —Teams with no connected analytics, project tracker, or knowledge base — the evidence gate and drift detection need sources of truth; without them, agents mostly propose and the autonomous value is limited
- —Anyone who wants a fully automated PM function with zero human review — this system is designed for a human-watches-metrics model; irreversible decisions always require a human
- —Teams looking for a standalone hosted SaaS PM tool — this runs inside Claude Code or Cowork, not as an independent product
From Install to First Autonomous PM Cycle in Four Steps
- 1Install in Claude Code or Cowork with two commands — under two minutes
- 2Connect your existing stack: Linear or Jira, Notion, Amplitude or Pendo, Intercom, and Slack
- 3Run /pm-cycle against your current backlog — watch the evidence gate, the adversaries, and the escalation logic run on real data in propose-only mode
- 4Switch on write/act lanes one at a time as the adversary's catch rate earns your trust. The human handles the rare strategic call; the agents handle everything else
Start with One Pod. Watch the Evidence Gate Work.
Two commands to install. Every lane starts in propose-only mode — no autonomous write until you decide it is ready. Free, open source, and built on Anthropic's official product-management plugin foundation.
No spam, ever. Your email is stored securely so we can send you updates about new use cases and workflows.
Step-by-Step: Install the Plugin
Add the plugin source
In Claude Code or Cowork, run: claude plugin marketplace add mlobo2012/autonomous-pm-plugin — this registers the repository as a trusted plugin source.
Install the plugin
Run: claude plugin install autonomous-pm — Claude downloads 17 specialist agent definitions, 6 pod skills, 6 driver commands, the deterministic enforcement spine (4 Python modules), and 2 enforcement hooks.
Connect your systems
Open .mcp.json and authenticate your project tracker (Linear, Jira, Asana), knowledge base (Notion), product analytics (Amplitude, Pendo), customer feedback (Intercom), meeting transcription (Fireflies), and team chat (Slack). Each category has a connector slot — bring your own if your tool is not listed.
Start in propose-only mode
Run /pm-cycle. Every pod runs in propose-only by default — agents draft outputs, open investigation tickets as comments, and flag escalations, but no autonomous writes happen yet. Watch the nine dashboard signals and confirm the evidence gate and adversaries are calibrated to your backlog.
Switch on write/act lanes as trust builds
Once adversary catch rate and escalation precision are in the healthy range, enable reversible write/act operations per pod. The enforcement hooks remain active at every stage. Each lane has a kill-switch that drops it back to propose-only without taking the system down.
Frequently Asked Questions
Anthropic's plugin is an excellent starting point — seven well-crafted PM skills for spec writing, roadmap planning, stakeholder updates, research synthesis, competitive briefs, metrics review, and brainstorming, all of which this plugin keeps as its foundation. The original is a PM advisor: every skill ends with a question, it touches your systems read-only, and a human still does every action. This plugin adds the autonomous operating layer on top: 17 named agents across 5 pods, a shared Evidence Store where every claim must cite a real record or be blocked, four adversarial verifiers with the power to halt or escalate, and a closed loop that pre-registers launch predictions and checks whether they came true. More detail in the 'Built On Anthropic's Plugin' section above.
Only within the gates — and the gates are deterministic code, not suggestions. hooks/evidence-gate.sh blocks any priority score, spec verdict, or launch decision not citing a real evidence record. hooks/human-escalation.sh blocks any agent action classified as irreversible, high blast-radius, low-confidence, or contested by another agent. Irreversible and customer-facing actions — pricing, billing, data deletion, killing a live feature — are hard-stopped in code regardless of confidence score. Every autonomous action is logged with its evidence IDs and the adversary's dissent, so 'why did the system do that?' is always answerable. The research anchor: Cemri et al. (arXiv:2503.13657) found naive multi-agent systems failing 41–86% of the time; the enforcement spine is the countermeasure.
The connector map is inherited from Anthropic's plugin and covers the tools PMs already live in: Linear, Jira, Asana, monday.com, or ClickUp for the roadmap; Notion for knowledge; Amplitude or Pendo for product analytics; Intercom for customer feedback; Fireflies for meeting transcription; Slack for broadcasts; Gmail for email; Google Calendar for scheduling; and Similarweb for competitive intelligence. Without connector authentication, the enforcement spine and a full demo cycle still run in propose-only mode — agents draft outputs for human review but do not write to your systems. Always-on writes activate when you connect your own tools and opt into scheduled runs.
Automated (no human needed): scheduled signal pulls from every connected source, evidence record creation, backlog scoring with citations, reprioritising Next/Later items, opening investigation and assumption-test tickets, posting Signal Digest and drift flags, drafting specs, running pre-mortem analysis, pre-registering launch predictions, pulling actuals at decision dates, writing iterate/sunset tickets, posting decision broadcasts to Slack. Human required: any reprioritisation moving a committed Now item or overriding a stakeholder commitment; the final ship-gate approval (always, no exception); killing or sunsetting a live customer-facing feature; any action touching pricing, billing, data deletion, or security; any situation where agents disagree and the Consensus Checker cannot reconcile; and any low-confidence or novel situation the Escalation Router flags. The design target: two human touch-points per product bet — a 20-second priority confirmation and a 2-minute scoped build approval.
Partly. The enforcement spine — the evidence gate, the four adversarial verifiers, the escalation classifier, the victory-bias check, the drift classifier — runs from day one without any connectors. You can run /pm-cycle against a fixture dataset and watch the full sense-decide-de-risk-learn-align loop execute, with gates blocking and escalating correctly. What is not out of the box: always-on unattended operation. Scheduled signal pulls from your real Amplitude or Linear, autonomous writes to your actual Notion, and continuous drift monitoring against your live strategy documents activate when you authenticate your own connectors and opt into scheduled runs. The honest framing: this is a working multi-agent PM system with a real enforcement spine you can run today; unattended autonomy is the mode you switch on once you have connected your own tools.
Two commands in Claude Code or Cowork: 'claude plugin marketplace add mlobo2012/autonomous-pm-plugin' followed by 'claude plugin install autonomous-pm'. This downloads 17 specialist agent definitions across 5 pods, 6 pod skills, 6 driver commands (/pm-prioritize, /pm-drift-scan, /pm-ship-gate, /pm-launch-loop, /pm-standup, /pm-cycle), the deterministic enforcement spine (lib/evidence.py, lib/escalation.py, lib/launch.py, lib/drift.py), and 2 enforcement hooks. For local development, point Claude Code at the plugin directory with 'claude --plugin-dir /path/to/autonomous-pm-plugin'. To activate live operation, connect your tools through their MCP servers — each category has a connector slot in .mcp.json.
Six documented failure modes anchor each design decision. (1) HiPPO prioritisation: named by Kaushik and Kohavi as 'one of the most significant impediments to evidence-based decision-making' — closed by the evidence-gate and the Evidence Adversary. (2) Most bets fail: Kohavi's Microsoft data shows only about one-third of well-designed experiments improve the target metric — closed by the Pre-Mortem Ship-Gate and pre-registered launch targets. (3) Shipped does not equal adopted: Pendo's 2019 data found 80% of features rarely or never used; Standish CHAOS put it at 64% — closed by the Closed-Loop pod and the Victory-Bias Auditor. (4) Stale signal and assumption drift: Torres' Continuous Discovery Habits prescribes weekly contact and continuous assumption testing — closed by always-on Signal Ingestor and Drift Sentinel. (5) Confident-wrong calls: a single-agent system that writes a PRD has no mechanism to argue against it — closed by adversarial verifiers throughout. (6) Naive multi-agent failure: Cemri et al. (arXiv:2503.13657) found 41–86% failure rates across 7 frameworks, classifying failures into specification issues (41.8%), inter-agent misalignment (36.9%), and verification failures (21.3%) — directly addressed by typed hand-off contracts, the shared Evidence Store, and the four named adversarial verifiers.
Yes. The plugin is free and open source at github.com/mlobo2012/autonomous-pm-plugin. You need an active Claude Code or Cowork subscription, and the connectors to your existing tools at their own pricing — Linear, Notion, Amplitude, and so on. There is no additional charge for the plugin itself. AI Heroes offers paid implementation: connector wiring, enforcement spine calibration, dashboard design, and a crawl-walk-run rollout for teams who want a managed setup rather than self-serve. Book a call to discuss.
Sources & Research
Every design decision in this plugin is grounded in a documented, verifiable study, dataset, or authoritative framework — cited inline throughout this page. Last verified 2026-05-24.
Pendo — 2019 Feature Adoption Report (80% of features rarely or never used)
Pendo's anonymised dataset across hundreds of B2B and B2C products found that 80% of features are rarely or never used, and that roughly 12% of features drive 80% of daily usage. The build-trap pattern this documents is the primary motivation for the Closed-Loop pod's pre-registered launch tracking and the Victory-Bias Auditor. Sources: https://www.pendo.io/resources/the-2019-feature-adoption-report/ and coverage at https://wraltechwire.com/2020/01/28/pendo-study-with-80-of-features-not-used-software-execs-re-evaluating-success-metrics/
View source ↗Standish Group CHAOS — 64% of features rarely or never used
The Standish CHAOS data, aggregated by Ant Murphy, found 45% of features are never used and 19% are rarely used — 64% total. This corroborates the Pendo figure from a separate methodology and is the second anchor for measuring the feature-factory failure mode the Closed-Loop pod is designed to close. Source: https://www.antmurphy.me/newsletter/why-you-should-remove-features
View source ↗Kohavi et al. — Online Experimentation at Microsoft (~1/3 of experiments improve the metric)
Ronny Kohavi's Microsoft experimentation data, published in 'Online Experimentation at Microsoft,' found that of well-designed, well-executed experiments built to move a key metric, only about one-third actually improved it — roughly one-third did nothing, and one-third made things worse. This is the primary evidence that confident PRDs are more likely wrong than right, and the direct justification for the Pre-Mortem Ship-Gate and the evidence-gate requiring citations on every priority claim. Sources: https://exp-platform.com/Documents/ExP_DMCaseStudies.pdf and O'Reilly summary at https://www.oreilly.com/radar/the-sobering-truth-about-the-impact-of-your-business-ideas/
View source ↗HiPPO — Kaushik / Kohavi, ExP Platform
HiPPO — the Highest Paid Person's Opinion — was named by Avinash Kaushik and popularised by Kohavi's experimentation team at Microsoft as 'one of the most significant impediments to greater evidence-based and data-driven decision-making.' The Evidence Adversary and the no-claim-without-citation gate are the direct mechanical countermeasures to this documented failure mode. Sources: https://exp-platform.com/hippo/ and https://www.kaushik.net/avinash/seven-steps-to-creating-a-data-driven-decision-making-culture/
View source ↗Cemri, Pan, Yang et al. — Why Do Multi-Agent LLM Systems Fail? MAST taxonomy, arXiv:2503.13657 (41–86.7% failure)
Cemri et al. annotated 1,600+ traces across 7 multi-agent frameworks and found state-of-the-art systems failing 41%–86.7% of the time. They classified failures into the MAST taxonomy: specification and design issues (41.8%), inter-agent misalignment (36.9%), and verification failures (21.3%). The three MAST buckets map directly onto this plugin's countermeasures: typed hand-off payloads with evidence IDs (anti-misalignment), a gate sequence with explicit definitions of done (anti-spec failure), and four named adversarial verifiers (anti-verification failure). Source: https://arxiv.org/abs/2503.13657
View source ↗Teresa Torres — Continuous Discovery Habits / Opportunity Solution Trees
Torres' continuous discovery framework prescribes weekly customer contact and systematic testing of the riskiest assumptions with the weakest evidence, because on-demand synthesis 'prevents outdated assumptions from lingering and ensures the roadmap is shaped by evidence, not guesswork.' The always-on Signal Ingestor and the Drift Sentinel's freshness alerts are the direct implementation of this prescription. Source: https://www.producttalk.org/opportunity-solution-trees/
View source ↗Melissa Perri — Escaping the Build Trap / feature factory
Perri's build-trap pattern names the root cause of shipped-but-never-used features: teams measured on output (features shipped, dates hit) instead of outcomes (did it create value). The Closed-Loop pod's pre-registered metrics, Adoption Auditor, and Victory-Bias Auditor are the structural cure — making outcome measurement automatic and making a 'success' verdict impossible without clearing the pre-registered target. Source: https://www.befreed.ai/book/escaping-the-build-trap-by-melissa-perri
View source ↗Start with One Pod. Watch the Evidence Gate Work.
Two commands to install. Every lane starts in propose-only mode — no autonomous write until you decide it is ready. Free, open source, and built on Anthropic's official product-management plugin foundation.
Want This Running for Your Team — Without the Setup Overhead?
AI Heroes configures the full stack: connector wiring, enforcement spine calibration, crawl-walk-run rollout, and operator dashboard design. Your team watches metrics from day one instead of spending weeks on integration.
