Editorial pen-and-watercolour schematic of a Claude Managed Agents system: a lead agent carrying the Claude wordmark delegating to specialist sub-agents on a shared filesystem, with a separate grader checking output against a rubric and a dreaming memory-curation loop

How Claude Managed Agents Actually Work: Dreaming, Outcomes, Multiagent Orchestration, and Webhooks (2026)

Marco Lobo
··9 min read
Share

TL;DR

  • Claude Managed Agents is Anthropic's hosted layer for running agents that improve over time. At Code w/ Claude in San Francisco (May 2026) it gained four mechanics: Dreaming, Outcomes, Multiagent Orchestration, and Webhooks.
  • The one that changes how you build is Outcomes: a separate grader evaluates the agent's work in its own context window, isolated from the agent's reasoning, and sends it back to revise until a rubric is met. Anthropic reports up to a 10-point task-success lift over a plain prompting loop.
  • The features are the easy part. Writing a rubric that captures "good," drawing clean orchestration boundaries, and wiring webhooks, sandboxes, and tunnels into systems you already run is the work a launch leaves to you.

Anthropic spent its San Francisco developer conference making one argument: the distance between an idea and working software is narrowing, and the teams getting leverage are the ones designing for that rather than reacting to it. The most concrete evidence was a set of new capabilities for Claude Managed Agents — the platform layer that runs agents on Anthropic's infrastructure rather than on your laptop.

Four mechanics landed together. Each is useful on its own. Read together, they describe a single shift: agents that grade their own work, split it across specialists, remember what worked, and tell you when they are done.

What are Claude Managed Agents?

Claude Managed Agents is Anthropic's hosted platform for running self-improving agents, where Anthropic operates the agent loop — orchestration, context management, and error recovery — while you define the task, the tools, and the criteria for success. It is distinct from Claude Code, the developer CLI: Managed Agents is the API-and-Console surface for agents that run server-side, on a schedule or on demand, without a human watching every step.

The four features announced on 6 May 2026 break down like this:

FeatureWhat it doesStage (May 2026)Reach for it when
DreamingA scheduled job reviews past sessions and memory, extracts patterns, and curates what the agent remembersResearch previewMany similar runs repeat the same mistakes or rediscover the same workflow
OutcomesA separate grader scores output against your rubric in its own context window and loops the agent until it passesPublic betaQuality is hard to hit in one pass but easy to describe in a rubric
Multiagent OrchestrationA lead agent splits work and delegates to specialists, each with its own model, prompt, and tools, in parallel on a shared filesystemPublic betaThe job is too big or too varied for one context window
WebhooksDefine an outcome, let the agent run, and get notified when it finishesPublic betaTasks run long enough that babysitting them is the real cost

How does Dreaming work, and when should you use it?

Dreaming is a scheduled process that reviews an agent's past sessions and memory stores, extracts patterns, and curates memory so the agent improves between runs. Anthropic describes it as surfacing what a single session cannot see on its own: recurring mistakes, workflows that agents converge on, and preferences shared across a team. It can update memory automatically or hold changes for review before they land.

Dreaming is in research preview. Anthropic cites the legal-AI company Harvey, whose completion rates "went up ~6x in their tests" during the pilot. Treat that the way you would any vendor pilot number reported through the vendor's own blog: a sourced anecdote from one team's tests, not a published benchmark with a baseline and a methodology. The durable part is the mechanism — periodic, structured memory curation instead of an ever-growing, ever-noisier memory file.

Reach for Dreaming when you run many similar agent sessions and the bottleneck is drift: the agent keeps relearning the same lessons, or its memory has thickened into noise. Leave it off when sessions are one-offs with no pattern worth distilling.

How do Outcomes work, and why does a separate grader matter?

Outcomes lets you write a rubric describing what success looks like, then runs a separate grader that evaluates the agent's output against that rubric in its own context window — "so it isn't influenced by the agent's reasoning," in Anthropic's words. When the work falls short, the grader pinpoints what to change and the agent takes another pass, repeating until it clears the bar.

The detail that matters is the separation. An agent grading its own work is the same model that just produced it, primed by its own reasoning to believe it is finished. A grader in a clean context window carries no such prior. Splitting generation from evaluation is the most reliable pattern there is for making agents trustworthy, and Outcomes turns it into a platform primitive instead of something every serious team rebuilds by hand.

Anthropic reports the lift from internal testing: task success improved by up to 10 points over a standard prompting loop, with the largest gains on the hardest problems, and specific results of +8.4% on docx and +10.1% on pptx tasks. Those are Anthropic's own numbers, so weight them as directional — but the direction matches what eval-driven teams already see when they stop trusting a single pass.

Outcomes is in public beta. Reach for it when "good" is hard to produce in one shot but easy to describe. Skip it when you cannot write the rubric, because a vague rubric produces a grader that waves everything through.

How does Multiagent Orchestration work?

Multiagent Orchestration lets a lead agent break a job into pieces and delegate each one to a specialist with its own model, prompt, and tools, with the specialists working in parallel on a shared filesystem and contributing back to the lead agent's context. Persistent events let the lead agent check in mid-workflow, and the whole flow is traceable in the Claude Console — which agent did what, and why.

Anthropic's worked example is an investigation: a lead agent runs the case while subagents fan out across deploy history, error logs, metrics, and support tickets at the same time. It names Netflix's platform team, which built a log-analysis agent processing hundreds of builds, and Spiral, which pairs orchestration with Outcomes to enforce writing quality.

It is in public beta. Reach for it when the work is genuinely too large or too varied for one context window — parallel research, multi-source investigation, jobs where different sub-tasks want different models. The trap is using it when one well-scoped agent would do: every boundary you draw between agents is a place where context can drop, so orchestration pays off only when the parallelism is real.

What do Webhooks add?

Webhooks let you define an outcome, start the agent, and receive a notification when the run finishes — so long-running work does not need a human watching it. Paired with Outcomes, it closes a loop: define the target result, let the agent grade-and-revise its way there, and get pinged when it clears the bar. It is in public beta. The pattern it unlocks is fire-and-verify — agents that own a task end to end and report back, rather than ones you sit beside.

Where do self-hosted sandboxes and MCP tunnels fit?

Two weeks later, at the London event on 19 May 2026, Anthropic shipped the features that make Managed Agents viable for teams that cannot send data out: self-hosted sandboxes and MCP tunnels. They share one design idea — the managed split. The agent loop that handles orchestration, context management, and error recovery stays on Anthropic's infrastructure, while the parts that touch your data move into your environment.

With self-hosted sandboxes (public beta), tool execution runs on infrastructure you control — your own, or a managed provider like Cloudflare, Daytona, Modal, or Vercel — so files and repositories never leave your perimeter, and you set the compute and the runtime image. With MCP tunnels (research preview), agents reach MCP servers inside your private network without exposing them to the internet: a lightweight gateway you deploy makes a single outbound connection, with no inbound firewall rules, no public endpoints, and traffic encrypted end to end. Tunnels work in Managed Agents and the Messages API, and organisation admins manage them from the Console.

CapabilityWhat runs whereStageThe point
Self-hosted sandboxesTool execution in your environment or a managed provider; the agent loop stays with AnthropicPublic betaFiles and repositories don't leave your perimeter
MCP tunnelsPrivate MCP servers reached over one outbound connection; no inbound rules, encrypted end to endResearch previewInternal databases and APIs become agent tools without public exposure

For a regulated team, this is the difference between an interesting demo and something you can actually deploy.

What changed with the doubled Claude Code rate limits?

Anthropic doubled Claude Code's five-hour rate limits for Pro, Max, Team, and seat-based Enterprise plans, removed the peak-hours reduction for Pro and Max, and raised API limits for Claude Opus; the free plan was unchanged. Reporting put the Tier 1 API increase at roughly fifteen times the maximum input tokens per minute and nine times the output. For agent builders, the headroom is the story: orchestrated, graded, long-running agents burn far more tokens than a chat session, and the previous limits made serious server-side agents hit a wall.

When should you reach for each feature?

Match the feature to the bottleneck, not to the announcement. Outcomes when quality is the problem and you can describe what good looks like. Multiagent Orchestration when scope is the problem and the work truly parallelises. Dreaming when repetition is the problem and the agent keeps relearning. Webhooks when latency-tolerance is the problem and nobody should be watching the run. Sandboxes and tunnels when trust boundaries are the problem and data cannot leave.

Most production agents end up using two or three together — an orchestrated job whose specialists are graded by Outcomes, running in your own sandbox, pinging a webhook when done. Wiring those together against systems you already run, and writing rubrics a grader can actually act on, is the work a launch leaves to you.

If you are turning these capabilities into something your team actually runs, these AI Heroes pieces are the natural companion set:

Frequently Asked Questions

Marco Lobo

Founder, AI Heroes

I build AI companies and the systems inside them. At AI Heroes, we give businesses the functional capacity to grow without the headcount growth normally demands — sales that follows up, marketing that runs, content that ships, ops that handles itself. We audit where you're leaving growth on the table, build the team that captures it, and hand it over completely.

I've built at scale before. Leading product and GTM at SlideSpeak AI (1M+ monthly users, profitable, bootstrapped). CPO at Disperse — the AI construction platform that went from 3 to 200+ people on $35M raised. I also co-founded LOBOMAR, a luxury fashion label featured in Elle, Cosmopolitan, and the LA Times, with shows at the London Design Museum, Wereldmuseum, and Amsterdam Fashion Week.

Related Articles

AI Heroes editorial quote card tagged Anthropic · Best Practices. A hand-drawn illustration shows one hand passing a folder to another. The quote reads: “chat is for when the output is a thought in your head, claude cowork is for when the output is something you hand to someone else,” attributed to Austin Lau, Growth Marketing Lead, Anthropic.
AI GuidesClaude CoworkAnthropic

How to Get Started with Claude Cowork: A Decision Framework for Knowledge Workers (2026)

Claude Cowork is where you delegate a whole task instead of asking a question — point it at your files and apps, describe the outcome, get finished work. The hard part isn't the prompt, it's knowing which tasks to hand it. Here's a 5-signal fit test, the three shapes a Cowork task can take, and how to get your first deliverable in ten minutes.

Marco Lobo
Marco Lobo·4 Jun 2026·9 min read
Handdrawn editorial illustration: a capable Claude agent (Anthropic wordmark + symbol legible) straining against heavy scaffolding poles, ropes and bolted-on guard rails labelled "orchestration", "tool wrappers", "fat system prompt"; a lighter, cleaner frame beside it labelled "boundaries that matter"; calm cream background, pen-and-watercolour style
AI EngineeringAgent HarnessHarness Debt

Harness Debt: Your AI Agent Scaffolding Is Quietly Fighting the Model (2026)

Your AI agent is probably worse than the model inside it — and the gap is your own scaffolding. An experimental harness scored over 2x Anthropic's standard one on the same model. The fix isn't a bigger framework; it's deleting the assumptions that went stale the day Claude Opus 4.6 shipped.

Marco Lobo
Marco Lobo·23 May 2026·11 min read
Handdrawn editorial diagram of the Generator-Evaluator harness pattern — a three-agent triangle with a Planner agent expanding a 1-4 sentence prompt into a product spec, a Generator agent building feature-by-feature using a React + Vite + FastAPI + SQLite stack, and an Evaluator agent using Playwright MCP to navigate the live app and grade against design quality, originality, craft, and functionality criteria; file-based handoff arrows between the three agents; by Anthropic Labs wordmark top-right, Claude Agent SDK badge bottom-right
AI EngineeringClaude Agent SDKAnthropic

Harness Design for Long-Running AI Applications: Inside Anthropic's Generator-Evaluator Pattern (Claude Agent SDK, 2026)

On 24 March 2026 Anthropic Labs engineer Prithvi Rajasekaran published the most rigorous public account to date of how Anthropic designs harnesses for long-running AI applications — a GAN-inspired generator-evaluator pattern applied across two unusually different domains: frontend design (subjective, no binary verification) and full-stack coding (objective, machine-verifiable). The piece evolves the November 2025 Initializer + Coding Agent baseline into a three-agent planner + generator + evaluator architecture, with concrete cost-and-duration data ($200 / 6h on a retro game maker test, then $124 / 4h on a more ambitious DAW after the Opus 4.6 simplification pass). Inside the pattern, the two failure modes it fixes (context anxiety + self-evaluation bias), how it compares to LangGraph / AutoGen / OpenAI Assistants v2 / Devin, when it doesn't fit, and the canonical principle every team operating a harness should adopt: stress-test every component against the current model.

Marco Lobo
Marco Lobo·22 May 2026·13 min read