How Claude Managed Agents Actually Work: Dreaming, Outcomes, Multiagent Orchestration, and Webhooks (2026)
TL;DR
- Claude Managed Agents is Anthropic's hosted layer for running agents that improve over time. At Code w/ Claude in San Francisco (May 2026) it gained four mechanics: Dreaming, Outcomes, Multiagent Orchestration, and Webhooks.
- The one that changes how you build is Outcomes: a separate grader evaluates the agent's work in its own context window, isolated from the agent's reasoning, and sends it back to revise until a rubric is met. Anthropic reports up to a 10-point task-success lift over a plain prompting loop.
- The features are the easy part. Writing a rubric that captures "good," drawing clean orchestration boundaries, and wiring webhooks, sandboxes, and tunnels into systems you already run is the work a launch leaves to you.
Anthropic spent its San Francisco developer conference making one argument: the distance between an idea and working software is narrowing, and the teams getting leverage are the ones designing for that rather than reacting to it. The most concrete evidence was a set of new capabilities for Claude Managed Agents — the platform layer that runs agents on Anthropic's infrastructure rather than on your laptop.
Four mechanics landed together. Each is useful on its own. Read together, they describe a single shift: agents that grade their own work, split it across specialists, remember what worked, and tell you when they are done.
What are Claude Managed Agents?
Claude Managed Agents is Anthropic's hosted platform for running self-improving agents, where Anthropic operates the agent loop — orchestration, context management, and error recovery — while you define the task, the tools, and the criteria for success. It is distinct from Claude Code, the developer CLI: Managed Agents is the API-and-Console surface for agents that run server-side, on a schedule or on demand, without a human watching every step.
The four features announced on 6 May 2026 break down like this:
| Feature | What it does | Stage (May 2026) | Reach for it when |
|---|---|---|---|
| Dreaming | A scheduled job reviews past sessions and memory, extracts patterns, and curates what the agent remembers | Research preview | Many similar runs repeat the same mistakes or rediscover the same workflow |
| Outcomes | A separate grader scores output against your rubric in its own context window and loops the agent until it passes | Public beta | Quality is hard to hit in one pass but easy to describe in a rubric |
| Multiagent Orchestration | A lead agent splits work and delegates to specialists, each with its own model, prompt, and tools, in parallel on a shared filesystem | Public beta | The job is too big or too varied for one context window |
| Webhooks | Define an outcome, let the agent run, and get notified when it finishes | Public beta | Tasks run long enough that babysitting them is the real cost |
How does Dreaming work, and when should you use it?
Dreaming is a scheduled process that reviews an agent's past sessions and memory stores, extracts patterns, and curates memory so the agent improves between runs. Anthropic describes it as surfacing what a single session cannot see on its own: recurring mistakes, workflows that agents converge on, and preferences shared across a team. It can update memory automatically or hold changes for review before they land.
Dreaming is in research preview. Anthropic cites the legal-AI company Harvey, whose completion rates "went up ~6x in their tests" during the pilot. Treat that the way you would any vendor pilot number reported through the vendor's own blog: a sourced anecdote from one team's tests, not a published benchmark with a baseline and a methodology. The durable part is the mechanism — periodic, structured memory curation instead of an ever-growing, ever-noisier memory file.
Reach for Dreaming when you run many similar agent sessions and the bottleneck is drift: the agent keeps relearning the same lessons, or its memory has thickened into noise. Leave it off when sessions are one-offs with no pattern worth distilling.
How do Outcomes work, and why does a separate grader matter?
Outcomes lets you write a rubric describing what success looks like, then runs a separate grader that evaluates the agent's output against that rubric in its own context window — "so it isn't influenced by the agent's reasoning," in Anthropic's words. When the work falls short, the grader pinpoints what to change and the agent takes another pass, repeating until it clears the bar.
The detail that matters is the separation. An agent grading its own work is the same model that just produced it, primed by its own reasoning to believe it is finished. A grader in a clean context window carries no such prior. Splitting generation from evaluation is the most reliable pattern there is for making agents trustworthy, and Outcomes turns it into a platform primitive instead of something every serious team rebuilds by hand.
Anthropic reports the lift from internal testing: task success improved by up to 10 points over a standard prompting loop, with the largest gains on the hardest problems, and specific results of +8.4% on docx and +10.1% on pptx tasks. Those are Anthropic's own numbers, so weight them as directional — but the direction matches what eval-driven teams already see when they stop trusting a single pass.
Outcomes is in public beta. Reach for it when "good" is hard to produce in one shot but easy to describe. Skip it when you cannot write the rubric, because a vague rubric produces a grader that waves everything through.
How does Multiagent Orchestration work?
Multiagent Orchestration lets a lead agent break a job into pieces and delegate each one to a specialist with its own model, prompt, and tools, with the specialists working in parallel on a shared filesystem and contributing back to the lead agent's context. Persistent events let the lead agent check in mid-workflow, and the whole flow is traceable in the Claude Console — which agent did what, and why.
Anthropic's worked example is an investigation: a lead agent runs the case while subagents fan out across deploy history, error logs, metrics, and support tickets at the same time. It names Netflix's platform team, which built a log-analysis agent processing hundreds of builds, and Spiral, which pairs orchestration with Outcomes to enforce writing quality.
It is in public beta. Reach for it when the work is genuinely too large or too varied for one context window — parallel research, multi-source investigation, jobs where different sub-tasks want different models. The trap is using it when one well-scoped agent would do: every boundary you draw between agents is a place where context can drop, so orchestration pays off only when the parallelism is real.
What do Webhooks add?
Webhooks let you define an outcome, start the agent, and receive a notification when the run finishes — so long-running work does not need a human watching it. Paired with Outcomes, it closes a loop: define the target result, let the agent grade-and-revise its way there, and get pinged when it clears the bar. It is in public beta. The pattern it unlocks is fire-and-verify — agents that own a task end to end and report back, rather than ones you sit beside.
Where do self-hosted sandboxes and MCP tunnels fit?
Two weeks later, at the London event on 19 May 2026, Anthropic shipped the features that make Managed Agents viable for teams that cannot send data out: self-hosted sandboxes and MCP tunnels. They share one design idea — the managed split. The agent loop that handles orchestration, context management, and error recovery stays on Anthropic's infrastructure, while the parts that touch your data move into your environment.
With self-hosted sandboxes (public beta), tool execution runs on infrastructure you control — your own, or a managed provider like Cloudflare, Daytona, Modal, or Vercel — so files and repositories never leave your perimeter, and you set the compute and the runtime image. With MCP tunnels (research preview), agents reach MCP servers inside your private network without exposing them to the internet: a lightweight gateway you deploy makes a single outbound connection, with no inbound firewall rules, no public endpoints, and traffic encrypted end to end. Tunnels work in Managed Agents and the Messages API, and organisation admins manage them from the Console.
| Capability | What runs where | Stage | The point |
|---|---|---|---|
| Self-hosted sandboxes | Tool execution in your environment or a managed provider; the agent loop stays with Anthropic | Public beta | Files and repositories don't leave your perimeter |
| MCP tunnels | Private MCP servers reached over one outbound connection; no inbound rules, encrypted end to end | Research preview | Internal databases and APIs become agent tools without public exposure |
For a regulated team, this is the difference between an interesting demo and something you can actually deploy.
What changed with the doubled Claude Code rate limits?
Anthropic doubled Claude Code's five-hour rate limits for Pro, Max, Team, and seat-based Enterprise plans, removed the peak-hours reduction for Pro and Max, and raised API limits for Claude Opus; the free plan was unchanged. Reporting put the Tier 1 API increase at roughly fifteen times the maximum input tokens per minute and nine times the output. For agent builders, the headroom is the story: orchestrated, graded, long-running agents burn far more tokens than a chat session, and the previous limits made serious server-side agents hit a wall.
When should you reach for each feature?
Match the feature to the bottleneck, not to the announcement. Outcomes when quality is the problem and you can describe what good looks like. Multiagent Orchestration when scope is the problem and the work truly parallelises. Dreaming when repetition is the problem and the agent keeps relearning. Webhooks when latency-tolerance is the problem and nobody should be watching the run. Sandboxes and tunnels when trust boundaries are the problem and data cannot leave.
Most production agents end up using two or three together — an orchestrated job whose specialists are graded by Outcomes, running in your own sandbox, pinging a webhook when done. Wiring those together against systems you already run, and writing rubrics a grader can actually act on, is the work a launch leaves to you.
Related reading
If you are turning these capabilities into something your team actually runs, these AI Heroes pieces are the natural companion set:
- Claude skills: why your best prompts keep failing - the architecture layer that turns judgment into reusable agent execution.
- AI agent workflow automation - how recurring work becomes an agent workflow instead of an ad hoc chat.
- The long-running agent harness on the Claude Agent SDK - the evaluator-gated loop behind agents you can leave running.
- AI institutional knowledge - why durable memory matters once more than one person relies on an agent.
- Inside Anthropic's finance team - what a managed, human-reviewed agent workflow looks like in practice.
- Where to start with Claude Code in a large repo - the decision layer that runs before the build.
Frequently Asked Questions

Founder, AI Heroes
I build AI companies and the systems inside them. At AI Heroes, we give businesses the functional capacity to grow without the headcount growth normally demands — sales that follows up, marketing that runs, content that ships, ops that handles itself. We audit where you're leaving growth on the table, build the team that captures it, and hand it over completely.
I've built at scale before. Leading product and GTM at SlideSpeak AI (1M+ monthly users, profitable, bootstrapped). CPO at Disperse — the AI construction platform that went from 3 to 200+ people on $35M raised. I also co-founded LOBOMAR, a luxury fashion label featured in Elle, Cosmopolitan, and the LA Times, with shows at the London Design Museum, Wereldmuseum, and Amsterdam Fashion Week.
Related Articles

How to Get Started with Claude Cowork: A Decision Framework for Knowledge Workers (2026)
Claude Cowork is where you delegate a whole task instead of asking a question — point it at your files and apps, describe the outcome, get finished work. The hard part isn't the prompt, it's knowing which tasks to hand it. Here's a 5-signal fit test, the three shapes a Cowork task can take, and how to get your first deliverable in ten minutes.

Harness Debt: Your AI Agent Scaffolding Is Quietly Fighting the Model (2026)
Your AI agent is probably worse than the model inside it — and the gap is your own scaffolding. An experimental harness scored over 2x Anthropic's standard one on the same model. The fix isn't a bigger framework; it's deleting the assumptions that went stale the day Claude Opus 4.6 shipped.

Harness Design for Long-Running AI Applications: Inside Anthropic's Generator-Evaluator Pattern (Claude Agent SDK, 2026)
On 24 March 2026 Anthropic Labs engineer Prithvi Rajasekaran published the most rigorous public account to date of how Anthropic designs harnesses for long-running AI applications — a GAN-inspired generator-evaluator pattern applied across two unusually different domains: frontend design (subjective, no binary verification) and full-stack coding (objective, machine-verifiable). The piece evolves the November 2025 Initializer + Coding Agent baseline into a three-agent planner + generator + evaluator architecture, with concrete cost-and-duration data ($200 / 6h on a retro game maker test, then $124 / 4h on a more ambitious DAW after the Opus 4.6 simplification pass). Inside the pattern, the two failure modes it fixes (context anxiety + self-evaluation bias), how it compares to LangGraph / AutoGen / OpenAI Assistants v2 / Devin, when it doesn't fit, and the canonical principle every team operating a harness should adopt: stress-test every component against the current model.
