It’s time to talk about one of the most critical drivers for successful AI agents: in-session context engineering.
I can hear about half of you screaming in excitement and half of you groaning as I write these words lol
The truth is that even for non-technical people understanding how in-session memory management works is increasingly critical to getting real work done. We all need to get how it works to work well in 2026.
A few days ago I wrote about domain memory—why agents fail on work that spans multiple sessions, and how structured external records fix it. The response told me something: people are hitting these walls everywhere.
Think of domain memory as the library: an agent can have a GREAT library and still have a really overloaded desk. And then you’re not getting anywhere.
That overloaded desk is a companion failure mode that’s just as common and just as poorly understood as the domain memory problem. It’s not about what happens between sessions. It’s about what happens within a single session as the agent runs longer.
And what’s interesting is that this within-session problem set seems more widespread than the domain memory piece.
Think about it: it’s relatively easy to come up with a big pile of external memory your agents can access and use. It may take executive blessing (which is why I aimed that first piece at leaders via executive circle), but it gets done.
For engineers and builders, this second piece around in-session state management is the hard part, because getting to a ‘clean desk’ for an agent is a really hard technical problem. And that’s exactly what we’re going to demystify and tackle here.
Watch any agent work on a complex task. For the first ten minutes, it’s sharp. Clear reasoning, appropriate tool use, steady progress. Then something shifts. Around minute twenty or thirty—or after a few dozen tool calls—the agent starts repeating itself. It forgets constraints it acknowledged earlier. It tries approaches it already tried. The reasoning that looked crisp at minute five turns muddy and unreliable.
This isn’t the domain memory problem. You could have perfect external records of project state, and this would still happen. The agent knows what it’s supposed to do. It just... loses the thread. And that’s not an LLM intelligence problem. A smarter model will run into the SAME issue.
My naive assumption was this was an AI intelligence problem. I kept seeing this pattern and assuming it was a model limitation—something that would get fixed when context windows got bigger or models got smarter. That assumption was wrong.
The research now shows that longer context windows often make things worse, not better. And the organizations running agents at production scale—Google, Anthropic, the Manus team—have converged on an explanation that changes how I think about building these systems.
The problem isn’t that agents can’t hold enough information. The problem is that every token you add to the context window competes for the model’s attention. Stuff a hundred thousand tokens of history into the window and the model’s ability to reason about what actually matters degrades. The critical constraint from step three gets buried under the noise from steps four through forty. The agent doesn’t forget because it ran out of space—it forgets because signal got drowned by accumulation.
This is the context engineering problem. And it turns out there’s a coherent framework for solving it.
Three papers from late 2025 lay out the architecture. They lay out how to get that work done. Google’s Agent Development Kit takes a different approach: instead of letting papers pile higher with every task, the agent clears the desk and pulls only what’s relevant for the current step. Stanford and SambaNova’s ACE research shows agents can learn from their own mistakes mid-task—noticing when they grabbed the wrong file and adjusting, without needing to be rebuilt from scratch. And Manus, one of the most widely-used consumer agents, published hard-won lessons after four complete redesigns, explaining how they finally learned to keep their agent focused even when a single task touches fifty different tools.
If domain memory is about what the agent reads at the start of a session, context engineering is about what the agent sees at every step within the session. The two patterns work together. You need both.
Here’s what’s inside:
The non-technical TLDR: If you’re not technical, you have a clear summary of why the heck you should care and what we are talking about
Why accumulation fails: The research on context rot, attention budgets, and why million-token windows made the problem worse
Context as compiled view: The architectural shift from “append everything” to “compute what’s relevant”—and why it determines whether agents can run for minutes or hours
The four-layer memory model: Working context, sessions, memory, and artifacts—what each layer stores, how they interact, and why the separation matters
Nine scaling principles: The specific patterns that make long-running agents work, drawn from all three papers, with tradeoffs and implementation details
Nine failure modes: How agents break when these principles are ignored—the patterns I see repeatedly in broken implementations
What becomes possible: The capabilities that only exist with correct memory architecture—not incremental improvements, but qualitatively different work
Where to build: A note on where practitioners are actually building these agents
Twelve design prompts to build your own context architecture:
State Persistence Analysis — Classify what your agent must remember vs. discard
View Compilation Design — Define the minimal context needed for each decision
Retrieval Trigger Design — Solve the problem of memory that never gets used
Attention Budget Allocation — Justify every token in your context window
Summarization Schema Design — Specify what must survive compression
External Memory Architecture — Draw the line between context and storage
Multi-Agent Scope Design — Test whether agent splits add clarity or just complexity
Cache Stability Optimization — Audit for cost and latency at scale
Failure Reflection System — Design how agents learn from mistakes
Architecture Ceiling Test — Find where your harness limits model capability
Context Observability Audit — Build the tracing layer for production debugging
The Non-Tech Prompt — Make sense of all this if you’re NOT an engineer
Clearing the desk for the agent is one of the biggest blockers in the way of long-running agentic workflows.
And those matter because we can get to REAL value real fast if we can give our agents long-running tasks and trust them to get it done.
Think about it: how many workflows get unlocked if you had an AI that could dependably focus for hours vs. minutes? What if that agent could LEARN and record its evolving strategy as it went, improving future runs? That’s what we’re talking about here.
Let’s dive in and learn how to clear the desk for our agents!
Listen to this episode with a 7-day free trial
Subscribe to Nate’s Substack to listen to this post and get 7 days of free access to the full post archives.














