Free & Open Source

Your PM Loop, Running Itself — With Evidence

Anthropic's product-management plugin is a brilliant advisor. It drafts specs, reviews metrics, and brainstorms ideas — and then it waits for you to act. This plugin does the acting: it moves the Linear card, argues against the spec it just wrote, remembers whether last quarter's bet actually worked, and interrupts you only when a decision is genuinely irreversible.

Built directly on Anthropic's official product-management plugin. The 7 inherited PM skills work from day one. Always-on unattended operation — scheduled signal pulls, autonomous writes to your tools — activates when you connect your own Linear, Notion, or Amplitude. This is a working multi-agent PM system with a real enforcement spine you can run today; it is not zero-setup full autonomy out of the box.

Why Most AI PM Tools Ship the Wrong Things Faster

Every documented AI PM failure follows the same pattern: the model wrote a confident spec, nothing argued against it, nobody checked whether it shipped and got used, and the team shipped the wrong thing more efficiently than before. Pendo's data shows 80% of features are rarely or never used. Kohavi's Microsoft data shows only about one-third of expert-designed bets actually move the target metric. Cemri et al. found naive multi-agent systems failing 41–86% of the time. The failure is never the model — it is the absence of a gate, an adversary, and a closed loop.

How AI PM tools break in production

  • The assistant writes a confident RICE score from numbers a stakeholder asserted. Nobody can trace why that item is at the top of the backlog.
  • A spec gets written. Nothing argues it will fail. The build ships. Six months later, adoption is 3% of target and there is no record of what was predicted.
  • Metrics are reviewed when someone remembers to run the command — not when a roadmap assumption just got falsified by last week's data.
  • An autonomous action runs on a live system. Nobody knows on what evidence, or what the dissenting view was.

How this plugin is engineered differently

  • Every priority score, spec verdict, and launch decision must cite a real Evidence Store record — or it is blocked at the hook level, not softened.
  • Four adversarial verifiers are structurally opposed to the agents that produce confident outputs. They cannot be skipped to hit a date.
  • Every shipped bet pre-registers its predicted metric and target. A Victory-Bias Auditor makes 'success' impossible in code without clearing that pre-registered bar.
  • Signal ingestion is scheduled, not on-demand. The Drift Sentinel flags falsified assumptions proactively — before the next quarterly review.

One Product Bet, From Signal to Verdict — No PM Loop Required

Here is what actually happens when a new signal surfaces. Follow it through all five pods. The gates are not suggestions — they block or escalate.

  1. 1

    Sense — Signal Ingestor + Drift Sentinel (Pod 2)

    The Signal Ingestor's scheduled pull ingests new customer tickets, analytics movements, call transcripts, and competitive changes into the Evidence Store — each record carrying a source URI, capture timestamp, and strength rating. The Drift Sentinel diffs the fresh signal against your strategy documents and raises a flag when a roadmap assumption is contradicted, citing both the standing claim and the opposing evidence. No flag can be raised without both sides present.

  2. 2

    Decide — Evidence Librarian + Prioritization Analyst + Evidence Adversary (Pod 1)

    The Evidence Librarian resolves each backlog candidate to its supporting evidence records. The Prioritization Analyst proposes RICE scores where every component cites real evidence IDs — a Confidence score with no linked record is auto-clamped. The Evidence Adversary then runs against the top items: it must surface a disconfirming signal or certify none exists. Any reprioritisation above the reversibility threshold stops for a human review — 20 seconds, full dissent visible.

  3. 3

    De-Risk — Spec Author + Assumption Mapper + Pre-Mortem Red Team + Discovery Runner (Pod 3)

    The Spec Author drafts the PRD. The Assumption Mapper extracts every load-bearing assumption and plots it on a riskiness-versus-evidence map. The Pre-Mortem Red Team assumes the feature already shipped and failed — it must produce the strongest disconfirming case and the cheapest experiment to kill the riskiest, weakest-evidence assumption first. The build epic in your tracker is gated on assumption-status. The Discovery Runner launches the test. The human is the final ship-gate approver — always — and sees the Red Team's kill-case before clicking approve.

  4. 4

    Learn — Launch Tracker + Adoption Auditor + Outcome Adjudicator + Victory-Bias Auditor (Pod 4)

    At ship time, the Launch Tracker pre-registers the predicted metric, target, and decision date. At the decision date, the Adoption Auditor pulls actuals from your analytics. The Outcome Adjudicator renders iterate / hold / double-down / kill with the delta cited. The Victory-Bias Auditor checks against the pre-registered target — not a post-hoc redefinition. A 'success' verdict is mechanically impossible in code if the actual metric did not clear the original target. Failed bets are written back as evidence records, which feed the next Prioritization and Drift cycles.

  5. 5

    Align — Alignment Broadcaster + Escalation Router + Consensus Checker (Pod 5)

    Every autonomous decision emits an audience-tailored update carrying the evidence IDs and the adversary's dissent. The Escalation Router classifies each decision by reversibility and blast radius — the rare strategic call goes to the human as a decision-ready brief with a deadline; everything reversible and confident flows without interruption. The Consensus Checker scans continuously for conflicts between pods or against standing commitments — any open conflict blocks the decision until reconciled.

Six Things the System Does That a PM Assistant Cannot

Seventeen specialist agents across five pods. Four of them — the Evidence Adversary, the Pre-Mortem Red Team, the Victory-Bias Auditor, and the Consensus Checker — exist for one purpose only: to block a confident-wrong call before it becomes a shipped feature no one uses.

📥

Acts on your tools — with gated WRITE/ACT

Agents operate over your connector MCPs: Linear or Jira for the roadmap, Notion for knowledge, Amplitude or Pendo for metrics, Slack for broadcasts. Two hooks — evidence-gate.sh and human-escalation.sh — decide whether an action can proceed automatically or must stop for human approval. Example: 'Reprioritise my backlog from this week's signal' actually moves the Linear cards, opens the investigation ticket, and posts the Slack update — when the action is reversible and well-evidenced.

🔎

Evidence substrate — no claim without a citation

lib/evidence.py records every source-backed evidence entry. hooks/evidence-gate.sh blocks any priority score, spec verdict, or launch decision that does not cite a real evidence record. There is no soft path — missing citations cause a hard block, not a warning. Example: HiPPO dies here. Nothing gets prioritised 'because the CEO asked'; it cites a metric, ticket cluster, or transcript, or it is rejected at the gate.

🛡️

4 adversarial verifiers that argue against the confident call

Four verifiers — Evidence Adversary, Pre-Mortem Red Team, Victory-Bias Auditor, and Consensus Checker — are placed where confident mistakes are most expensive. lib/launch.py makes a 'validated' outcome mechanically impossible unless the actual metric clears the pre-registered target. Example: 'Spec out SSO' returns the strongest case it will flop plus the cheapest test to kill the risk first, before engineering commits a single sprint.

🔁

Always-on sensing + two-sided drift detection

A Signal Ingestor runs on a schedule, not on demand. A Drift Sentinel uses lib/drift.py's two-sided rule: it cannot call a roadmap assumption contradicted without both the documented claim and the opposing signal — preventing hallucinated drift. Example: 'Which roadmap assumptions did the last two weeks of data just falsify?' is pushed proactively instead of waiting for a quarterly review.

📚

Closed loop with memory — bets, actuals, verdicts

A Launch Tracker pre-registers the success metric and decision date at ship time. An Adoption Auditor pulls actuals at the decision date. An Outcome Adjudicator writes the next iterate ticket, hold signal, or sunset recommendation — with the actual-vs-target delta cited. Example: 'Did what we shipped last quarter get adopted, and what should we cut?' becomes a routine operating question with an evidence trail — not a quarterly retrospective you have to schedule.

📊

Calibrated human-escalation boundary

lib/escalation.py classifies every action by reversibility, blast radius, confidence, and agent disagreement. hooks/human-escalation.sh blocks any auto-action that hits the escalation threshold. Every autonomous decision is logged with its evidence IDs and dissent for full audit. Example: You watch a dashboard and get pinged only for the rare call — 'kill this live feature?' or 'override this reprioritisation?' — not 'which competitor should I analyse?'

Built on Anthropic's Official Product-Management Plugin

This plugin starts where Anthropic's official product-management plugin leaves off — it does not replace it. Anthropic ships seven well-crafted PM skills. They are the foundation. Every skill, command, and PM framework is kept intact. What this plugin adds is the autonomous operating layer the original deliberately leaves to the builder: the right to act on your systems, a team of cross-checking agents, an evidence substrate, and four adversarial verifiers that argue back. Here is exactly what Anthropic's plugin does in each area, and what the autonomous layer turns it into.

View Anthropic's official product-management plugin on GitHub

Spec Writing — /write-spec

Out of the box

You run /write-spec with a problem or idea. The skill drafts a clear PRD: requirements, scope, success metrics, and open questions. It ends by asking: 'Want me to add an edge-case section? Want me to tailor this for engineering?'

With the autonomous agents

The Spec Author still drafts the PRD. Then an Assumption Mapper extracts every load-bearing assumption and tags it by evidence strength. A Pre-Mortem Red Team then argues the feature already shipped and failed — and names the cheapest test to kill the riskiest, weakest-evidence assumption first. The build ticket in your tracker is gated until that test runs or strong evidence already exists on record.

Roadmap Update — /roadmap-update

Out of the box

You run /roadmap-update. The skill helps create, refresh, or reprioritise a roadmap using familiar planning formats. You supply the context; it formats the output and asks if you want to adjust timelines or add an exec summary.

With the autonomous agents

A Prioritization Analyst derives RICE scores from your actual analytics, ticket clusters, and research transcripts — every component must cite an Evidence Store record or it is auto-clamped to the floor. An Evidence Adversary then runs against the top items: it must either surface a disconfirming signal or certify none exists. When the reprioritisation moves a committed item, the human gets 20 seconds of review with the full dissent in front of them.

Metrics Review — /metrics-review

Out of the box

You run /metrics-review. The skill reads your connected analytics, surfaces trends, flags movements, and ends with suggested follow-up actions — a one-shot scorecard you read and then decide what to do about.

With the autonomous agents

A Signal Ingestor pulls metrics on a schedule — not on demand. A Drift Sentinel diffs the live signal against your strategy documents and flags when a roadmap assumption is contradicted. An Adoption Auditor then compares every shipped feature against its pre-registered success metric at the decision date. A Victory-Bias Auditor blocks any 'success' verdict that did not clear the original target — goalpost-moving is impossible in code.

Research Synthesis — /synthesize-research

Out of the box

You run /synthesize-research with interviews, surveys, tickets, and notes. The skill turns the raw material into themes, personas, and opportunities — a structured synthesis document ready for a planning session.

With the autonomous agents

A Signal Ingestor continuously ingests customer calls, support tickets, and competitor signals into a shared Evidence Store — not when you remember to ask. A Synthesis Editor writes the weekly digest and drift log. The Drift Sentinel compares new evidence against standing persona and assumption documents, and raises a 'contradicted' flag (citing both the claim and the opposing signal) when reality has moved.

Competitive Brief — /competitive-brief

Out of the box

You run /competitive-brief. The skill produces a competitor brief with feature comparisons, positioning gaps, and strategic implications — a polished document triggered when a competitor ships something or a sales call surfaces a gap.

With the autonomous agents

Competitive signal becomes a standing monitor lane inside Signal Ingestor, writing cited competitive evidence records on a schedule. When a competitor ships a feature that directly contradicts a roadmap assumption, the Drift Sentinel flags it proactively — not on the next time someone runs the command.

Stakeholder Update — /stakeholder-update

Out of the box

You run /stakeholder-update after a decision. The skill drafts audience-tailored updates for executives, engineering, customers, or cross-functional partners — a document you review and send when you remember to.

With the autonomous agents

An Alignment Broadcaster fires automatically on every autonomous decision — a reprioritisation, a ship-gate block, a launch verdict — and attaches the evidence IDs and the adversary's dissent to each. An Escalation Router classifies actions by reversibility and blast radius and routes only the rare strategic call to the human, packaged as a decision-ready brief. A Consensus Checker blocks any decision while two pods are in conflict.

Brainstorm — /brainstorm

Out of the box

You run /brainstorm to stress-test product ideas, explore problem spaces, and generate options — a thinking-partner session that diverges and then hands the options back to you.

With the autonomous agents

Divergent brainstorming is retained and now paired with adversarial convergence: the Pre-Mortem Red Team's generative half produces the strongest counter-case to the chosen option and the cheapest test to kill the riskiest assumption. The output is a spec with a test plan attached, not a list of ideas.

Is This the Right Tool?

Best Fit

  • PMs, founders, and product operators already in Claude Code or Cowork, living in Linear, Notion, Amplitude, and Intercom — who want to supervise a PM loop rather than manually run one
  • Teams where HiPPO or opinion-driven prioritisation is a recurring problem and evidence-grounded backlog management is the goal
  • Product organisations that ship features that go unused and want a structural closed loop — pre-registered predictions, actuals pulled at decision date, verdicts written to the tracker
  • Teams who already use Anthropic's product-management plugin and want to add the autonomous operating layer on top without replacing what works

Not the Right Fit

  • Teams with no connected analytics, project tracker, or knowledge base — the evidence gate and drift detection need sources of truth; without them, agents mostly propose and the autonomous value is limited
  • Anyone who wants a fully automated PM function with zero human review — this system is designed for a human-watches-metrics model; irreversible decisions always require a human
  • Teams looking for a standalone hosted SaaS PM tool — this runs inside Claude Code or Cowork, not as an independent product

From Install to First Autonomous PM Cycle in Four Steps

  1. 1Install in Claude Code or Cowork with two commands — under two minutes
  2. 2Connect your existing stack: Linear or Jira, Notion, Amplitude or Pendo, Intercom, and Slack
  3. 3Run /pm-cycle against your current backlog — watch the evidence gate, the adversaries, and the escalation logic run on real data in propose-only mode
  4. 4Switch on write/act lanes one at a time as the adversary's catch rate earns your trust. The human handles the rare strategic call; the agents handle everything else

Start with One Pod. Watch the Evidence Gate Work.

Two commands to install. Every lane starts in propose-only mode — no autonomous write until you decide it is ready. Free, open source, and built on Anthropic's official product-management plugin foundation.

No spam, ever. Your email is stored securely so we can send you updates about new use cases and workflows.

Step-by-Step: Install the Plugin

1

Add the plugin source

In Claude Code or Cowork, run: claude plugin marketplace add mlobo2012/autonomous-pm-plugin — this registers the repository as a trusted plugin source.

2

Install the plugin

Run: claude plugin install autonomous-pm — Claude downloads 17 specialist agent definitions, 6 pod skills, 6 driver commands, the deterministic enforcement spine (4 Python modules), and 2 enforcement hooks.

3

Connect your systems

Open .mcp.json and authenticate your project tracker (Linear, Jira, Asana), knowledge base (Notion), product analytics (Amplitude, Pendo), customer feedback (Intercom), meeting transcription (Fireflies), and team chat (Slack). Each category has a connector slot — bring your own if your tool is not listed.

4

Start in propose-only mode

Run /pm-cycle. Every pod runs in propose-only by default — agents draft outputs, open investigation tickets as comments, and flag escalations, but no autonomous writes happen yet. Watch the nine dashboard signals and confirm the evidence gate and adversaries are calibrated to your backlog.

5

Switch on write/act lanes as trust builds

Once adversary catch rate and escalation precision are in the healthy range, enable reversible write/act operations per pod. The enforcement hooks remain active at every stage. Each lane has a kill-switch that drops it back to propose-only without taking the system down.

Frequently Asked Questions

Anthropic's plugin is an excellent starting point — seven well-crafted PM skills for spec writing, roadmap planning, stakeholder updates, research synthesis, competitive briefs, metrics review, and brainstorming, all of which this plugin keeps as its foundation. The original is a PM advisor: every skill ends with a question, it touches your systems read-only, and a human still does every action. This plugin adds the autonomous operating layer on top: 17 named agents across 5 pods, a shared Evidence Store where every claim must cite a real record or be blocked, four adversarial verifiers with the power to halt or escalate, and a closed loop that pre-registers launch predictions and checks whether they came true. More detail in the 'Built On Anthropic's Plugin' section above.

Only within the gates — and the gates are deterministic code, not suggestions. hooks/evidence-gate.sh blocks any priority score, spec verdict, or launch decision not citing a real evidence record. hooks/human-escalation.sh blocks any agent action classified as irreversible, high blast-radius, low-confidence, or contested by another agent. Irreversible and customer-facing actions — pricing, billing, data deletion, killing a live feature — are hard-stopped in code regardless of confidence score. Every autonomous action is logged with its evidence IDs and the adversary's dissent, so 'why did the system do that?' is always answerable. The research anchor: Cemri et al. (arXiv:2503.13657) found naive multi-agent systems failing 41–86% of the time; the enforcement spine is the countermeasure.

Sources & Research

Every design decision in this plugin is grounded in a documented, verifiable study, dataset, or authoritative framework — cited inline throughout this page. Last verified 2026-05-24.

Pendo — 2019 Feature Adoption Report (80% of features rarely or never used)

Pendo's anonymised dataset across hundreds of B2B and B2C products found that 80% of features are rarely or never used, and that roughly 12% of features drive 80% of daily usage. The build-trap pattern this documents is the primary motivation for the Closed-Loop pod's pre-registered launch tracking and the Victory-Bias Auditor. Sources: https://www.pendo.io/resources/the-2019-feature-adoption-report/ and coverage at https://wraltechwire.com/2020/01/28/pendo-study-with-80-of-features-not-used-software-execs-re-evaluating-success-metrics/

View source ↗

Standish Group CHAOS — 64% of features rarely or never used

The Standish CHAOS data, aggregated by Ant Murphy, found 45% of features are never used and 19% are rarely used — 64% total. This corroborates the Pendo figure from a separate methodology and is the second anchor for measuring the feature-factory failure mode the Closed-Loop pod is designed to close. Source: https://www.antmurphy.me/newsletter/why-you-should-remove-features

View source ↗

Kohavi et al. — Online Experimentation at Microsoft (~1/3 of experiments improve the metric)

Ronny Kohavi's Microsoft experimentation data, published in 'Online Experimentation at Microsoft,' found that of well-designed, well-executed experiments built to move a key metric, only about one-third actually improved it — roughly one-third did nothing, and one-third made things worse. This is the primary evidence that confident PRDs are more likely wrong than right, and the direct justification for the Pre-Mortem Ship-Gate and the evidence-gate requiring citations on every priority claim. Sources: https://exp-platform.com/Documents/ExP_DMCaseStudies.pdf and O'Reilly summary at https://www.oreilly.com/radar/the-sobering-truth-about-the-impact-of-your-business-ideas/

View source ↗

HiPPO — Kaushik / Kohavi, ExP Platform

HiPPO — the Highest Paid Person's Opinion — was named by Avinash Kaushik and popularised by Kohavi's experimentation team at Microsoft as 'one of the most significant impediments to greater evidence-based and data-driven decision-making.' The Evidence Adversary and the no-claim-without-citation gate are the direct mechanical countermeasures to this documented failure mode. Sources: https://exp-platform.com/hippo/ and https://www.kaushik.net/avinash/seven-steps-to-creating-a-data-driven-decision-making-culture/

View source ↗

Cemri, Pan, Yang et al. — Why Do Multi-Agent LLM Systems Fail? MAST taxonomy, arXiv:2503.13657 (41–86.7% failure)

Cemri et al. annotated 1,600+ traces across 7 multi-agent frameworks and found state-of-the-art systems failing 41%–86.7% of the time. They classified failures into the MAST taxonomy: specification and design issues (41.8%), inter-agent misalignment (36.9%), and verification failures (21.3%). The three MAST buckets map directly onto this plugin's countermeasures: typed hand-off payloads with evidence IDs (anti-misalignment), a gate sequence with explicit definitions of done (anti-spec failure), and four named adversarial verifiers (anti-verification failure). Source: https://arxiv.org/abs/2503.13657

View source ↗

Teresa Torres — Continuous Discovery Habits / Opportunity Solution Trees

Torres' continuous discovery framework prescribes weekly customer contact and systematic testing of the riskiest assumptions with the weakest evidence, because on-demand synthesis 'prevents outdated assumptions from lingering and ensures the roadmap is shaped by evidence, not guesswork.' The always-on Signal Ingestor and the Drift Sentinel's freshness alerts are the direct implementation of this prescription. Source: https://www.producttalk.org/opportunity-solution-trees/

View source ↗

Melissa Perri — Escaping the Build Trap / feature factory

Perri's build-trap pattern names the root cause of shipped-but-never-used features: teams measured on output (features shipped, dates hit) instead of outcomes (did it create value). The Closed-Loop pod's pre-registered metrics, Adoption Auditor, and Victory-Bias Auditor are the structural cure — making outcome measurement automatic and making a 'success' verdict impossible without clearing the pre-registered target. Source: https://www.befreed.ai/book/escaping-the-build-trap-by-melissa-perri

View source ↗

Start with One Pod. Watch the Evidence Gate Work.

Two commands to install. Every lane starts in propose-only mode — no autonomous write until you decide it is ready. Free, open source, and built on Anthropic's official product-management plugin foundation.

Want This Running for Your Team — Without the Setup Overhead?

AI Heroes configures the full stack: connector wiring, enforcement spine calibration, crawl-walk-run rollout, and operator dashboard design. Your team watches metrics from day one instead of spending weeks on integration.