Building Agents That Remember: The Rise of Procedural Memory

Author: Denis Avetisyan

Researchers are moving beyond prompting large language models to use tools, and instead focusing on architectures that allow agents to build and reuse reliable, deterministic code.

The system refines its operational directives through iterative self-instruction, analyzing conversational context and existing protocols to generate nuanced constraints-such as excluding specific textual elements-and thus evolves its behavior without requiring alterations to its underlying code, a process exemplified in applications like the LangGraph Tweet generator where meta-prompting drives adaptation.

CodeMem introduces a novel framework leveraging dynamic memory and procedural knowledge to create reproducible agents with enhanced tool-use capabilities.

While large language model (LLM) agents demonstrate promise in tool use, their inherent probabilistic nature limits reliability and hinders consistent performance on repetitive tasks. This paper introduces CodeMem: Architecting Reproducible Agents via Dynamic MCP and Procedural Memory, an architecture that enables LLM agents to build and utilize procedural memory encoded as executable code. By storing successful logic as reusable workflows, CodeMem shifts agents from improvisational tool callers to deterministic architects of automated processes. Could this approach unlock a new era of dependable and scalable agentic systems for complex, real-world applications?

The Illusion of Agency: Why Current Systems Are Foundering

Contemporary agentic systems, frequently built upon frameworks like ReAct, often falter when faced with reasoning tasks demanding multiple sequential steps. This limitation stems primarily from the finite context window of large language models – the amount of text they can process at once – which restricts the agent’s ability to maintain a coherent train of thought over extended interactions. Furthermore, these systems commonly rely on structured data formats, such as JSON, to facilitate communication with tools; however, this approach proves surprisingly fragile, as minor variations in the expected format can disrupt the entire process. The result is a tendency towards brittle behavior, where even slight deviations from anticipated scenarios lead to errors or failures, hindering reliable performance in complex, real-world applications.

As agentic systems grow in complexity, a critical bottleneck emerges: scalability. While increasing model size and context windows offer temporary improvements, the computational cost rises exponentially with each additional reasoning step. This inefficiency fundamentally limits the ability of these systems to reliably perform long-horizon planning – tasks requiring numerous sequential decisions and anticipating outcomes far into the future. Furthermore, traditional architectures struggle with novel situations because their reliance on pre-defined tools and patterns becomes brittle when confronted with unforeseen circumstances. The system’s performance degrades as it attempts to extrapolate beyond its training data, hindering its capacity to generalize and adapt – a key characteristic of true intelligence.

Current artificial intelligence agents often function as sophisticated token processors, receiving inputs and generating outputs based on patterns learned from vast datasets. However, this approach reaches inherent limitations when confronted with problems demanding genuine reasoning and long-term planning. A paradigm shift is therefore necessary, moving beyond mere pattern recognition to systems capable of executing logic – not just identifying it. This entails building agents that can internally represent knowledge, formulate plans based on symbolic reasoning, and interact with tools in a structured manner, independent of the constraints imposed by fixed context windows and brittle data formats. Such a move would enable agents to reliably navigate complex tasks, adapt to unforeseen circumstances, and ultimately exhibit a more robust and generalizable intelligence, mirroring the flexibility of human cognition.

Code as the Seed: Architecting for Adaptability

The CodeMem architecture fundamentally alters the role of the Large Language Model (LLM), moving away from text completion and towards the orchestration of executable processes. Instead of generating textual outputs, the LLM designs and manages workflows composed of distinct functional units. This is achieved through a secure Sandbox environment which isolates code execution, preventing unintended system access or modification. The LLM’s output is therefore not text per se, but a set of instructions defining the sequence and parameters of operations to be performed within the Sandbox, effectively turning the LLM into a dynamic workflow architect rather than a static text generator.

Traditional Large Language Model (LLM) agents commonly operate by selecting from a predefined set of tools via JSON-based API calls, limiting flexibility and reasoning complexity. The CodeMem architecture instead defines the agent’s action space as executable Python code. This allows for dynamic tool chaining, where the output of one tool directly informs the input of the next, creating complex workflows not achievable with static JSON calls. By generating and executing code, the agent gains programmatic control over tool interaction, enabling conditional logic, iterative refinement, and the creation of novel tool combinations – substantially improving reasoning capabilities and expanding the scope of solvable problems beyond the constraints of fixed API structures.

The CodeMem architecture enables the integration of new functionalities without model retraining by separating tool definition from tool execution. Traditionally, Large Language Models (LLMs) require retraining to incorporate new tools or capabilities; however, CodeMem allows for the dynamic introduction of tools defined as Python code within the secure Sandbox environment. This decoupling means the LLM’s core weights remain unchanged while new tools are simply made available for programmatic chaining via techniques like Dynamic ReAct. The LLM then leverages these tools based on the current task, effectively expanding its capabilities on-the-fly without necessitating a computationally expensive retraining process.

Persistent Memory: The Echo of Experience

Procedural memory, a core component of the CodeMem architecture, represents an agent’s implicit knowledge of task execution. This differs from declarative memory, which stores facts, by focusing on the ‘how’ rather than the ‘what’ of problem-solving. It is implemented as a store of executable code snippets or functions, enabling the agent to retain and reuse previously learned procedures. This capability is critical for building robust agents capable of consistently performing multi-step tasks and adapting to novel situations without requiring explicit re-instruction, as the agent can leverage stored procedural knowledge to guide its actions.

LangGraph and the CoALA Framework integrate procedural memory to enhance agentic capabilities by allowing agents to store and retrieve knowledge gained during interactions. These frameworks provide mechanisms for capturing task execution steps, identified patterns, and successful strategies as reusable components. This contrasts with stateless approaches where each interaction is treated independently. By persisting and recalling this procedural knowledge, agents can avoid redundant calculations, improve performance on repeated tasks, and generalize learned behaviors to novel situations, effectively transitioning from reactive responses to proactive problem-solving.

The CodeMem architecture utilizes specific tools, namely register_skill and write_todos, to enable agents to move beyond purely reactive behavior. register_skill allows the agent to define and store reusable code blocks representing specific task executions, effectively creating a library of procedural knowledge. Subsequently, write_todos facilitates the agent’s ability to plan and persist intermediate goals – or “todos” – that represent steps towards completing a larger task. This combination allows the agent to not simply respond to immediate prompts, but to proactively manage a sequence of actions, recalling and applying previously learned skills to new situations and improving task completion rates.

The CodeMem architecture improves agent performance by replacing the ReAct prompting strategy with the execution of actual code, coupled with the implementation of procedural memory. Testing on a dataset of 25 multi-step tasks demonstrated a minimum correctness rate of 96% when integrated with the Gemini 3 Full large language model. This signifies a substantial increase in both the reliability and stability of agents performing complex tasks, as the agent leverages stored procedural knowledge to consistently achieve accurate results across multiple steps.

Beyond Metrics: Judging the Quality of Thought

Conventional methods for gauging the intelligence of autonomous agents frequently struggle to capture the subtleties of their performance. Simple metrics, such as task completion rates or reward accumulation, often fail to differentiate between an agent that achieves a correct outcome through inefficient or convoluted steps and one that demonstrates elegant, reasoned behavior. This limitation becomes particularly pronounced when evaluating agents designed for complex, multi-step tasks requiring planning, adaptation, and iterative refinement. Consequently, relying solely on these traditional assessments can obscure crucial insights into an agent’s true capabilities, hindering meaningful progress in artificial intelligence and the development of genuinely intelligent systems. A more holistic evaluation is needed to properly assess the quality of an agent’s decision-making process, not merely the final result.

Traditional automated evaluation of artificial intelligence agents frequently relies on simplistic metrics that fail to capture the complexities of problem-solving processes. A novel approach utilizes large language models, such as Gemini 3 Full, as evaluators – effectively, LLM-as-a-Judge. This methodology transcends mere task completion assessment; it analyzes the entire execution trajectory of an agent, considering not just the final outcome but also the reasoning steps and iterative refinement employed. By scrutinizing this process, researchers gain a nuanced understanding of an agent’s strengths and weaknesses, identifying areas for improvement beyond simply whether an answer is correct or incorrect. This granular level of evaluation facilitates more targeted architectural refinements and a deeper comprehension of agent intelligence, ultimately leading to more robust and capable AI systems.

Recent advancements in agent intelligence have yielded significant performance gains, notably demonstrated by the CodeMem architecture. In evaluations using the challenging M3ToolEval benchmark, CodeMem achieved a 20% improvement in success rate compared to established ReAct agents. This leap in performance stems from CodeMem’s innovative approach to memory management, allowing the agent to more effectively retain and utilize past experiences during problem-solving. The benchmark, designed to rigorously test an agent’s ability to utilize tools and reason through complex tasks, provides a robust platform for quantifying these improvements and highlights CodeMem as a promising architecture for building more capable and reliable autonomous agents.

Recent evaluations utilizing the Gemini 3 Full language model demonstrate a significant capacity for efficient problem-solving in complex tasks. The average number of assistant calls – essentially, the iterative steps required to reach a successful outcome – was measured at 7.00 across a diverse benchmark. This relatively low call count suggests the model doesn’t merely stumble upon correct solutions, but instead exhibits robust reasoning capabilities and a capacity for strategic iteration. Such performance highlights an ability to effectively plan, execute, and refine approaches, indicative of a more sophisticated cognitive process than simple trial-and-error and providing a compelling metric for evaluating agent intelligence beyond basic task completion.

A detailed analysis of agent performance, facilitated by LLM-as-a-Judge evaluations, moves beyond simple pass/fail metrics to reveal specific areas where an agent excels or falters. This granular understanding isn’t merely diagnostic; it directly informs architectural refinements. By pinpointing weaknesses in reasoning, tool use, or iterative problem-solving – as demonstrated by improvements in benchmarks like M3ToolEval and reductions in assistant calls needed for task completion – developers can strategically focus improvements. This targeted approach contrasts with broad architectural changes, allowing for more efficient development cycles and ultimately, more capable agents. The methodology provides a feedback loop where evaluation isn’t an end in itself, but a crucial step in iterative design and optimization, pushing the boundaries of agent intelligence.

Towards Symbiosis: The Future of Intelligent Systems

The CodeMem architecture achieves improved numerical reasoning capabilities by integrating techniques such as Program of Thoughts (PoT), which strategically offloads complex logical operations to an external Python interpreter. This approach allows the agent to leverage the full power of a programming language for tasks involving arithmetic, symbolic manipulation, and algorithmic problem-solving, rather than relying solely on the language model’s inherent, and often limited, numerical abilities. By effectively shifting computational burden, CodeMem with PoT demonstrates a significant increase in accuracy when tackling quantitative challenges, exceeding the performance of traditional Chain of Thought prompting by approximately 10%. This highlights the potential of hybrid architectures that combine the strengths of large language models with the precision and expressiveness of executable code.

Recent advancements in prompting techniques for large language models demonstrate a notable performance increase through the implementation of Program of Thoughts (PoT). This approach, building upon the established Chain of Thought (CoT) methodology, allows the model to not simply verbalize its reasoning, but to actually execute numerical computations using a Python interpreter. Empirical results indicate that this shift from declarative reasoning to programmatic execution yields a significant improvement in accuracy – specifically, a 10% gain in correctly solved numerical reasoning problems compared to standard CoT prompting. This suggests that offloading calculation to an external tool not only verifies the model’s logic but also minimizes errors inherent in solely relying on the language model’s internal arithmetic capabilities, paving the way for more reliable and intelligent agents.

The CodeMem architecture exhibits a notable improvement in task completion efficiency when contrasted with the ReAct framework. Studies reveal that code agents leveraging CodeMem require, on average, 30% fewer interaction turns to achieve the same results. This heightened action efficiency stems from the architecture’s capacity to more effectively manage and utilize code execution as an integral part of the reasoning process. By directly interfacing with a Python interpreter, CodeMem agents minimize the need for iterative prompting and refinement characteristic of ReAct, streamlining the problem-solving pathway and accelerating task completion. This reduction in turns not only optimizes performance but also suggests a more focused and direct approach to utilizing computational resources, paving the way for more responsive and practical intelligent agents.

The progression towards increasingly sophisticated intelligent agents hinges significantly on advancements in how these systems store and utilize information. Current architectures, while demonstrating impressive capabilities, face limitations when confronted with truly complex challenges that demand vast knowledge bases and intricate reasoning processes. Consequently, dedicated research into efficient memory management – exploring techniques to compress, prioritize, and rapidly retrieve relevant data – is paramount. Simultaneously, innovations in knowledge representation, moving beyond simple data storage to nuanced contextual understanding and relational mapping, will be crucial. These combined efforts promise not only to enhance the scalability of existing agents but also to unlock the potential for tackling problems currently beyond their reach, fostering a future where artificial intelligence can effectively navigate and contribute to complex real-world scenarios.

The trajectory of artificial intelligence research aims beyond the development of sophisticated tools; it envisions agents functioning as genuine collaborators. These future systems will not merely execute pre-programmed instructions, but demonstrate independent thought processes, enabling them to approach challenges with creativity and formulate novel solutions. Crucially, continuous learning is paramount; these agents will adapt and refine their understanding through experience, autonomously expanding their knowledge base and improving their problem-solving capabilities. This shift from tool to partner represents a fundamental reimagining of the human-machine relationship, potentially unlocking new levels of innovation and addressing complex challenges with a synergy previously unattainable.

The pursuit of deterministic execution, as presented in this work with CodeMem, echoes a sentiment long held by those who grapple with complex systems. It’s a striving for control within a universe fundamentally resistant to it. As Carl Friedrich Gauss observed, “If I have seen further it is by standing on the shoulders of giants.” This architecture, by storing successful logic as procedural memory, doesn’t create intelligence, but rather carefully curates the accumulated wisdom of prior interactions. The system isn’t built, it grows from these stored experiences, a frozen compromise of past successes. Dependencies, inevitably, remain – but the attempt to solidify them into reusable code is a quiet acknowledgement of the inherent fragility of improvisation.

What Lies Ahead?

The pursuit of reliable agents, as exemplified by architectures like CodeMem, consistently reveals a fundamental truth: a system isn’t a machine to be built, but a garden to be grown. Storing successful logic as procedural memory addresses the immediate fragility of improvisational agents, yet merely shifts the problem. The garden will still require tending. What constitutes ‘successful’ logic, and how does one prune the inevitable overgrowth of brittle, context-specific solutions? The architecture’s reliance on deterministic execution within a sandbox is a commendable step, but resilience lies not in isolation, but in forgiveness between components; the ability to gracefully degrade, to reroute around failures, remains largely unexplored.

Future work will likely concentrate on the meta-cognitive aspects of this ‘memory’. An agent that simply stores solutions is not intelligent; one that understands why a solution works, and can adapt it to novel situations, is a different order of complexity. The challenge isn’t simply to increase the volume of procedural memory, but to cultivate a system capable of abstraction, generalization, and – crucially – the identification of patterns that transcend specific tool calls.

Ultimately, the field must acknowledge that every architectural choice is a prophecy of future failure. CodeMem offers a valuable approach to managing that failure, but the real innovation will lie in designing systems that anticipate it, and evolve accordingly. The garden, after all, always finds a way.

Original article: https://arxiv.org/pdf/2512.15813.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/