To Nha Notes | Technology blog posts

Stop Throwing Tokens at the Problem: Why Your AI Agents Need Operational Memory

To Nha Notes | Feb. 20, 2026, 11:43 a.m.

A reference to Ben Lorica's "Your Agents Need Runbooks, Not Bigger Context Windows" — Gradient Flow, February 2026

There's a quiet frustration building among teams deploying AI agents in production. The demos look great. The prototypes work. Then you hit real workloads — data pipelines breaking at 2 AM, API failures that need fifteen-step remediation sequences, hundreds of internal tools to navigate — and suddenly the agent that seemed so capable starts burning tokens, slowing down, and re-inventing the wheel on every single run.

Ben Lorica at Gradient Flow puts a name to this in his latest piece: the Context Tax.

The Core Problem: Agents That Never Learn

Today's AI agents treat every task like it's their first day on the job. Each time an agent runs, it re-reads documentation, re-plans its steps, re-discovers your infrastructure. If your agent fixes a broken ETL pipeline today, it pays the full "thinking cost" — parsing logs, identifying the failed step, executing a remediation sequence. Tomorrow, when the same pipeline breaks for the same reason, it pays that full cost again. And the day after that. The 1,000th execution is just as expensive as the first.

Lorica frames the root issue clearly: we are treating the agent's context window like volatile RAM — a workspace that wipes clean the moment a task is finished. Traditional computing stacks don't work this way. You don't reload your operating system every time you open a file. But in the current agent paradigm, we force models to "re-boot" their understanding of our infrastructure on every request.

This is compounded by two well-documented architectural limits of large language models. First, computational cost grows quadratically with context length — a task that should take one second can balloon to thirty seconds simply because the context is too large. Second, the "lost-in-the-middle" effect means that long prompts cause agents to lose track of information buried in the middle. Research shows that small, focused snippets are often twice as accurate as making an agent parse a giant document.

Why RAG Alone Won't Save You

The instinctive answer is Retrieval-Augmented Generation. RAG is excellent at fetching facts — finding a runbook document, surfacing a configuration guide. But as Lorica points out, RAG is built to find documents, not to remember how work was done.

RAG can locate a technical runbook for a database issue. It cannot remember that a specific five-step sequence of API calls resolved that exact issue last Tuesday. Every time a workflow starts, the agent still has to read the documentation and plan its steps from scratch. The organization never accumulates institutional knowledge from its agents' successes.

Other workarounds — dynamic tool loading, stateful continuity for personalization — address parts of the problem but leave the core gap open: there is no mechanism to "crystallize" a successful solution into a reusable, executable procedure.

The Missing Layer: A Context File System

Lorica's article introduces a compelling architectural response: the Context File System (CFS), also called an Operational Skill Store.

The concept mirrors how mature engineering teams operate. Senior staff solve a novel problem once, document the solution as a runbook, and from that point on the task becomes routine. A CFS does the same for agents — capturing successful multi-step workflows as versioned, executable procedures that can be replayed rather than re-planned.

In practice, rather than stuffing a prompt with "just-in-case" context, an agent using a CFS mounts and unmounts specific operational volumes as needed. It mounts a codebase volume to diagnose a bug, swaps to a technical runbook volume to execute the fix, then unmounts when done. Context becomes a managed resource — high-density, low-noise, and purpose-built for the task at hand.

The economics shift dramatically. The first execution of a workflow pays an "exploration cost" in tokens. Every run after that replays the proven procedure. According to the article, this can reduce token consumption by over 90 percent for repeated workflows.

Key Properties That Make This Work

A well-designed CFS comes down to a few structural properties that Lorica outlines:

Persistent Procedural Memory captures successful multi-step workflows as versioned procedures, eliminating repeated planning overhead for known task types.

Indexed Tool Discovery maintains an external index of tool capabilities and API schemas, loading only what's relevant for the current task — solving the "tool sprawl" problem where hundreds of API definitions eat your token budget before work even begins.

Separation of Reasoning and Execution reserves expensive model reasoning for genuinely novel problems. Routine work is handled by the memory layer. This turns what was a variable, scaling cost into a fixed, reusable asset.

Self-Healing Infrastructure monitors procedure success rates. When an underlying API changes and a workflow starts failing, the system automatically pulls the procedure and triggers a re-learning phase — agents don't silently accumulate stale knowledge.

Governance and Auditability records every action in an episodic memory layer, producing the execution traces and version history essential for debugging in regulated environments.

What This Means for Data Engineering

For teams running operational workloads — data pipelines, DevOps automation, infrastructure management — this architecture is directly relevant. The distinction Lorica draws is worth keeping in mind:

Stateful memory systems are the right choice when the goal is conversation and personalization — research copilots, personal assistants. A CFS architecture is the better choice for high-repetition operational work like DevOps automation or data engineering, where the goals are cost predictability and the reuse of proven procedures.

In a data engineering context, this maps clearly: an agent that monitors and repairs ETL pipelines should not be rebuilt as a conversational model. It should be backed by an operational skill store that accumulates proven remediation patterns over time, reducing cost per incident while improving reliability.

The relationship with the Model Context Protocol (MCP) is also worth noting. MCP standardizes how agents connect to internal systems. A CFS makes that connection economically viable at scale. Without operational memory, a large-scale MCP deployment risks collapsing under the weight of its own tool catalog — every new connection adds to the context burden rather than distributing the knowledge load.

The Bigger Picture

What Lorica is pointing toward is an architecture where agents compound organizational value over time. Knowledge gained by one agent execution becomes an immediate asset for the entire enterprise. Rather than isolated experiments, AI infrastructure becomes a library of expertise — cheaper and more reliable the more it is used.

This is the missing layer: not a smarter model, not a bigger context window, but a persistent store of how work actually gets done in your organization.

The full article is worth reading, particularly for teams at the stage of moving AI agents from prototype into production. The link is here.

Published February 2026 | Referencing: Ben Lorica, "Your Agents Need Runbooks, Not Bigger Context Windows," Gradient Flow

Posts

Technical AI Startup Work Learning Tools Books Kinh doanh Đọc sách Làm bánh Cuộc sống Tài chính Bất động sản Đầu tư Thuế Nice words