Hand-drawn editorial illustration on cream paper: a night-time study desk where a memory is being filed — a large open ledger labelled 'company memory' glows under a brass desk lamp, beside an open card-catalogue drawer and a hand filing an index card, with small kraft tags reading 'ingestion', 'the graph', 'the filesystem' and 'sleep-time', and a crescent moon in a corner window.

The Company That Remembers

Marco Lobo
··12 Min. Lesezeit
Share

TL;DR

  • The hard part of an "AI brain" was never the model — it is memory: how a system made of forgetful parts holds, updates, and occasionally discards what a company knows.
  • Three schools have hardened around a single question — when do you pay to resolve a stale fact into a new truth? At ingestion (compile clean truth on the way in), at graph-update time (a temporal graph that invalidates rather than deletes), or at query time (give the agent files and search, resolve nothing in advance).
  • The engineering is real and shipping — Cloudflare, Anthropic, and Letta all moved in 2026 — but the public accuracy benchmarks are broken, so no vendor can presently prove theirs remembers best. Pick the function with the worst memory loss, choose a school deliberately, and instrument your own return.

There is a class of software agent, in mid-2026, that sleeps.

Not literally — there is no cot, no dimmed screen. But when the humans stop typing and the Slack channel goes quiet, a second program wakes up and begins tidying. It re-reads the day's conversations. It opens the document somebody dropped in at 4 p.m. and nobody finished reading. It decides which of the things it learned today are worth keeping, rewrites them into cleaner sentences, and files them where the daytime agent will find them tomorrow. The company that builds this calls it sleep-time compute, and the architecture has a detail that is more revealing than it first appears: the daytime agent — the one that actually talks to people — is given no tools to edit its own long-term memory. Only the sleeping one can. The worker is not allowed to rewrite its own mind. That job belongs to the version of itself that comes out at night.

The company is Letta, the spinout that grew from the Berkeley research project once called MemGPT, and the sleep-time design has been public since the spring of 2025. I start here, with the sleeping agent, because it punctures the thing executives believe about this moment. The belief is that the race is about models — that whoever has the smartest LLM wins. The people actually building corporate AI have moved on from that. The model is a commodity that gets better on someone else's schedule. The thing they stay up at night solving — sometimes literally delegating it to a nocturnal copy of the software — is memory. How a system made of forgetful parts comes to hold, and update, and occasionally discard, what a company knows.

This is a report on how that is going. The short version: the engineering is real and shipping, the marketing is ahead of the receipts, and the most interesting fights in the field are not about intelligence at all. They are about time.


I. When do you pay for the truth?

Start with a deceptively boring question. A customer changes their billing address. Six months later your AI agent is asked where to send the renewal contract. Somewhere between those two moments, a piece of the company's knowledge went stale and a new truth replaced it. The question every company-brain system has to answer is: when did you do the work of resolving the old fact into the new one? You can pay that cost at three different times, and the three answers have hardened into three schools of design.

The first school pays at ingestion. The moment information arrives, the system resolves it — deduplicates it, links it to what's already known, and writes a single clean, canonical version to disk. The reader, later, just reads. This is the bet behind gbrain, the memory layer that runs across my own operation: raw signals come in continuously and are compiled into distilled "truth" pages, each one already reconciled, each one carrying an effective date so the system can later ask not just what is true but what was true then. Retrieval, when it happens, runs a hybrid of vector search and keyword search — two different ways of finding the same idea — and fuses the results, then tilts them along two independent dials: salience (does this matter?) and recency (is this fresh?). The philosophical commitment is that truth is something you manufacture on the way in, not something you reassemble on the way out.

The second school pays at graph-update time, and its most rigorous expression is Zep, built on an engine called Graphiti. Zep does not store clean pages; it stores a temporally-aware knowledge graph that, per its published design, "dynamically synthesizes both unstructured conversational data and structured business data while maintaining historical relationships." The elegant part is how it handles the stale billing address. It does not delete the old fact. It invalidates it. Every relationship in the graph carries four timestamps — two for when the system learned things, two for when they were actually true in the world — and when a new fact contradicts an old one, Zep reaches back and stamps the old edge with the exact moment the new one became valid. The past is never erased; it is closed, like a ledger entry, and the graph can always be replayed to any point in its own history. Neo4j, describing the approach, put it precisely: the system "updates or invalidates, but not discards, outdated information."

The third school pays at query time, and its champion is, again, Letta — but in its more recent and more deflationary mood. The argument, which I'll come back to because it is the most important thing in this entire field, is roughly: stop engineering elaborate memory structures, give the agent a pile of files and the ability to search them, and let it figure out what it needs at the moment it needs it. Resolve nothing in advance. Trust the search.

These are not interchangeable. Pay at ingestion and your reads are fast and your writes are expensive and opinionated — you are constantly deciding, up front, what the truth is. Pay at graph-update and you get a perfect audit trail and the ability to ask "what did we believe in March?" — at the cost of running a knowledge-graph engine that has to reason about contradictions in real time. Pay at query and you build almost nothing and lean entirely on the agent's judgment in the moment, which is cheap to build and, as we will see, unnervingly hard to trust. A founder choosing a company brain is not choosing a vendor. They are choosing when their organization wants to argue about what is true.


II. The newest machinery

If you want to know what changed in the last two months specifically — not the foundational ideas, which mostly settled in 2025, but the genuinely fresh engineering — there are two things worth your attention, and I want to be honest that the rest is consolidation.

In April 2026, Cloudflare shipped a system it calls, plainly, Agent Memory. Its retrieval design is the current high-water mark for the "throw everything at the wall" philosophy, and it is worth seeing in detail because it tells you where the consensus has landed. When the agent asks its memory a question, Cloudflare runs five searches in parallel. Two are lexical: a full-text search for keyword precision, and an exact lookup against known fact-keys. Two are semantic: a direct vector search on the embedded question, and a cleverer one called HyDE, where the system first imagines what a good answer would look like and then searches for memories resembling that rather than the question. The fifth searches the raw message history. Then it fuses all five rankings with a technique called Reciprocal Rank Fusion — a way of letting five disagreeing search methods vote. The lesson buried in that design is that nobody, in 2026, believes vector search alone is enough. The frontier is hybrid, and the argument is only about how many opinions to blend and how to weigh the votes.

Cloudflare also took a clear position on forgetting, and it is not Zep's. When a new memory arrives bearing the same key as an old one, the old one is "superseded rather than deleted," creating a version chain with a forward pointer from the stale fact to the fresh one. It is a humbler cousin of Zep's bi-temporal ledger — a linked list of "this replaced that" rather than a full four-dimensional history — but it shares the field's emerging article of faith: you do not throw memories away. You mark them dead and leave a trail to the living version.

The second recent development is quieter and, for a CEO, more suggestive. In January 2026, Letta shipped something called the Conversations API, and the framing in its own launch post is the cleanest articulation I've found of what a "company brain" might actually be in practice. A single agent, with one shared memory, can run hundreds of conversations at once — each Slack thread its own conversation, all of them feeding the same mind. Update a memory in one thread and it is visible in all of them. Letta's phrase for the result is "an institutional knowledge manager." Picture it: one entity that has sat in every channel, remembers every decision, and can be asked, in thread 341, about something it learned in thread 9. That is the dream the word "company brain" is selling.

I have to flag, because honesty demands it, that the institutional-knowledge-manager is described by Letta as a capability of the product, not a documented customer deployment with numbers attached. It is what the system can do, asserted by the people who built it. Hold that thought. It turns out to be the shape of nearly everything in this market.


III. The unglamorous part is most of the game

Beneath the philosophy sits plumbing, and the company that has documented its plumbing most usefully is Anthropic, whose primitives are what a lot of these company-brain agents are quietly built on top of. There are three of them, and they map onto a simple idea: an agent's working memory — the context window it's actively thinking in — is a small, expensive room, and most of the engineering is about keeping that room from overflowing.

The first primitive is compaction: when the conversation gets long, summarize the whole transcript and continue from the summary. The second is tool-result clearing, and it contains a genuinely elegant move — when an agent has called a tool and gotten back a big result it could always fetch again, the system drops the bulky result but keeps the record that the call happened. The agent remembers that it looked something up; it just doesn't carry the lookup around. The third is a memory tool: a file system the model can write to and read from across sessions, with an instruction injected into every session telling the model to check its memory directory before it does anything else. Crucially, the files live on the customer's infrastructure, not Anthropic's — the company keeps its own brain.

What makes Anthropic's account valuable rather than just tidy is that they put numbers on it. On a single long-running research task, an agent with no memory management peaked at 335,279 tokens of context. Turn on tool-result clearing and the peak dropped to about 173,137. Turn on compaction instead and it held at roughly 169,164. Those are illustrative figures from one demonstration, not a formal benchmark, and I'd cite them as such — but they make the stakes legible. Memory management is not a nicety. It is the difference between an agent that can run all afternoon and one that suffocates on its own transcript by lunch. The reason your company brain is hard to build is not that the model is dumb. It is that the model's attention is a tiny, costly room, and almost all the cleverness goes into deciding what to let into it.


IV. The feud, and the emperor's wardrobe

Now the part that should make an executive cautious, because it is where the marketing and the truth come apart.

For about two years the field has ranked itself on a benchmark called LoCoMo — a set of long conversations with questions about what was said. Vendors quote their LoCoMo scores the way restaurants quote stars. In May 2025 the truce broke. The CTO of Mem0, a memory startup, opened a public issue against a competitor, Zep, accusing it of inflating its headline result. Zep had reported 84% accuracy. Mem0's co-founder claimed that when you corrected the math, the real figure was 58.44% — a gap of more than twenty-five points. The alleged error was almost comically specific: Zep had supposedly excluded a category of hard adversarial questions from the denominator of its accuracy calculation while still counting the ones it got right from that category in the numerator. Zep published a rebuttal, recalculated, and landed on 75.14%. No independent referee has ever settled it.

You could read that as ordinary startup mudslinging. The deeper finding is worse for everyone. In August 2025, Letta ran an experiment that ought to be taped to the wall of every team about to spend six figures on a memory platform. They took a plain agent, gave it conversation histories as ordinary files, handed it nothing fancier than grep and the ability to open and search those files, and ran it on LoCoMo. It scored 74.0% — beating Mem0's reported 68.5% for its top-performing graph-based memory. A filesystem and a search command beat the specialized graph. Letta's own conclusion was admirably plain: memory is "more about how agents manage context than the exact retrieval mechanism used."

And then, in 2026, the floor gave way under the benchmark itself. An independent audit found that LoCoMo's answer key — the supposed ground truth — was wrong on 6.4% of its questions, ninety-nine outright errors. Worse, the automated judge used to grade answers would accept, in some configurations, up to 63% of deliberately wrong answers as correct. The yardstick everyone had been measuring with was bent, and a new line of research now argues the whole practice of equating memory with trivia retrieval was a category error.

Put the three things together — an unrefereed accuracy feud, a filesystem beating the fancy graph, a broken ruler — and you arrive at the single most useful sentence I can offer a buyer in 2026: nobody can presently prove, with independent evidence, that any one company-brain memory system is more accurate than the others. The architectures are real. The performance leaderboard is, for now, theater. To Letta's credit, they were careful not to over-claim from their own result: they scoped it to retrieval benchmarks, noted you can plug richer memory tools in when a task needs them, and argued for testing agents on dynamic work rather than recall quizzes. That restraint is the tell of a serious practitioner. Be wary of anyone selling a memory system who is less careful than the people who build them.


V. gbrain, and the cost of an opinion

Which brings me back to the system I know from the inside, because it sits at a specific and arguable point in all this. gbrain pays at ingestion: it compiles raw signal into reconciled "truth" the moment information lands, rather than reassembling it at read time the way Letta's filesystem agent does, or maintaining a live contradiction-resolving graph the way Zep does. The trade is honest and worth naming. The ingestion bet buys fast, clean reads and a memory that has already decided what it thinks — which is exactly what you want when an agent has to answer in a hurry and you'd rather it not re-derive the company's worldview from scratch every time. The price is that the deciding happens up front, under automation, and a wrong compilation is a wrong belief written down. Where Zep keeps an immutable ledger you can replay, the compiled-truth model keeps a current best answer and a timeline of how it got there. Both are trying to solve staleness; they just disagree about whether truth is a log or a latest value.

I include gbrain not because it wins — I've just spent a section arguing nobody can presently prove a winner — but because it makes the field's central choice unavoidable. Every company brain is an opinion about when to pay for the truth. There is no free version. There is only the bill, and the moment you choose to settle it.


VI. So where is the value, actually?

You asked the sharpest possible question — where is the value? — and the honest answer is the one most vendors won't give you: the capability is real; the receipts are thin. The recurring, defensible use case across every serious system is the same — institutional memory that doesn't walk out the door when a person does. The agent that has read every thread, remembers every decision, and can be asked about any of it. Letta frames it as the institutional knowledge manager; Anthropic's persistent file-memory points the same way; gbrain exists precisely so that work done in one channel survives into another. The function where this lands first and hardest is the one drowning in its own history: support and internal knowledge, then engineering and operations, then the executive's quiet wish for a memory of why a decision was made, not just that it was.

But I went looking, hard, for named companies with documented production numbers — this function, this team, this measured result — and the verifiable evidence is not yet there. What exists is a dense layer of capability claims from the people who sell the capability, and that is not the same thing as proof. For an executive deciding where to spend in 2026, that distinction is the whole game. The technology to give your company a durable, queryable memory is shipping and genuinely good. The independent evidence about which one, and how much it returns by function, is the part the market hasn't built yet. Anyone who tells you otherwise is selling you their benchmark.

If I were advising a founder this quarter, I'd say: the bet is sound, so make it — but make it small and instrumented. Pick the function with the worst memory loss (usually support or internal knowledge). Choose your school deliberately — ingestion if you need fast confident reads, graph if you need an auditable history, filesystem-and-search if you want to build almost nothing and trust the agent. Then measure your own return, because the public leaderboard can't do it for you. The companies that win the next two years won't be the ones that bought the highest benchmark score. They'll be the ones that figured out, on their own data, what their memory was actually worth.

The sleeping agent had it right all along. The hard part was never being clever in the moment. It was deciding, quietly and every night, what was worth keeping.

Häufig gestellte Fragen

Marco Lobo

Founder, AI Heroes

I build AI companies and the systems inside them. At AI Heroes, we give businesses the functional capacity to grow without the headcount growth normally demands — sales that follows up, marketing that runs, content that ships, ops that handles itself. We audit where you're leaving growth on the table, build the team that captures it, and hand it over completely.

I've built at scale before. Leading product and GTM at SlideSpeak AI (1M+ monthly users, profitable, bootstrapped). CPO at Disperse — the AI construction platform that went from 3 to 200+ people on $35M raised. I also co-founded LOBOMAR, a luxury fashion label featured in Elle, Cosmopolitan, and the LA Times, with shows at the London Design Museum, Wereldmuseum, and Amsterdam Fashion Week.

Ähnliche Artikel

Handgezeichneter redaktioneller Vergleich: Microsoft Scout als dauerhaft aktiver Autopilot gegenüber Claude Cowork als Delegation auf Abruf, 2026
KI-ToolsMicrosoft ScoutClaude Cowork

Microsoft Scout vs. Claude Cowork: Autopilot oder Delegation?

Zwei der größten Agenten-Launches 2026 setzen auf gegensätzliche Wetten. Microsoft Scout ist ein Desktop-Autopilot, der im Hintergrund läuft und in Ihrem Auftrag handelt; Claude Cowork wartet, bis Sie ihm eine Aufgabe übergeben, und liefert dann. Eines ist Push, das andere Pull — hier erfahren Sie, was zu Ihrem Team passt.

Marco Lobo
Marco Lobo·5. Juni 2026·10 Min. Lesezeit
HubSpot's agent-first GTM flywheel with Attract, Engage and Delight segments and reported results: 345,000 accounts added, 82% inbound chats handled by AI, 1,850% growth in leads from AI answers, 3x conversion, 13% higher win rate, 10,000+ meetings per quarter, 60% support tickets resolved by AI and a 7-point higher save rate.
Go-to-MarketAgent-First GTMAI Agents

Agent-First GTM in 2026: The Real Landscape, the Pricing Tell, and the Stage Nobody Owns

Agent-first GTM is no longer a slide. HubSpot, Salesforce and Microsoft are shipping AI agents across the funnel, but the field divides on autonomy, data ownership and the real tell: pricing.

Marco Lobo
Marco Lobo·1. Juni 2026·10 Min. Lesezeit
Handdrawn editorial system diagram contrasting a fragile screenshot click loop with Microsoft Webwright's terminal-native browser automation workflow: Microsoft Research and Webwright logo card, Playwright script, disposable browser sessions, logs, screenshots, and reusable tool library on cream paper
AI EngineeringWebwrightMicrosoft Research

What Are Terminal-Native Web Agents? Microsoft Webwright and the End of Click-by-Click Computer Use (2026)

The next reliable web agent will not just click better. Microsoft Webwright points at the real shift: terminal-native agents that turn repeated browser work into Playwright code, logs, screenshots, fresh reruns, and reusable tools.

Marco Lobo
Marco Lobo·27. Mai 2026·13 Min. Lesezeit