Giving Agents a Memory: Speeding Up AI with Reused Plans

Author: Denis Avetisyan


A new mechanism allows AI agents to drastically reduce response times by intelligently leveraging and reusing previously generated plans.

Large language models are being integrated into agent workflows, establishing a new paradigm where sophisticated reasoning and natural language understanding drive autonomous action and decision-making processes.
Large language models are being integrated into agent workflows, establishing a new paradigm where sophisticated reasoning and natural language understanding drive autonomous action and decision-making processes.

AgentReuse utilizes semantic similarity analysis and structured caching to achieve up to 93.12% latency reduction in LLM-driven agents for AIoT and serverless applications.

While large language models (LLMs) significantly enhance the capabilities of AI assistants, their plan generation latency can degrade user experience. This paper introduces ‘A Plan Reuse Mechanism for LLM-Driven Agent’-AgentReuse-which addresses this challenge by intelligently caching and reusing previously generated plans. Leveraging semantic similarity analysis and intent classification, AgentReuse achieves a 93% effective plan reuse rate, reducing latency by over 93% compared to baseline approaches. Could this mechanism unlock a new level of responsiveness and efficiency for the next generation of LLM-powered personal assistants and AIoT devices?


The Illusion of Autonomy: Why LLMs Aren’t Enough

The pursuit of genuinely autonomous agents reveals a critical limitation of current approaches centered solely on Large Language Models. While these models excel at pattern recognition and generating human-like text, true autonomy necessitates the ability to not only understand a situation but to proactively plan and reason about future actions. Simply predicting the next most probable response isn’t sufficient for navigating complex, dynamic environments; agents must be capable of setting goals, formulating strategies to achieve them, and adapting those strategies when faced with unforeseen circumstances. This requires integrating LLMs with symbolic reasoning systems and planning algorithms, effectively moving beyond reactive intelligence towards a more deliberate and foresightful form of artificial intelligence capable of independent problem-solving.

Conventional methods in autonomous agent design frequently falter when confronted with intricate, real-world challenges. These systems, often reliant on pre-programmed responses or limited datasets, exhibit significant latency – a delay between perceiving a situation and formulating an appropriate action – hindering performance in dynamic environments. More critically, a lack of adaptability plagues these agents; unforeseen circumstances or novel situations frequently trigger failures as they struggle to generalize beyond their training parameters. This inflexibility stems from an inability to effectively reason about the environment, plan for future contingencies, and adjust strategies on the fly, ultimately limiting their utility in complex and unpredictable scenarios.

The total latency for processing each request is composed of contributions from different methods, as illustrated by the breakdown in the figure.
The total latency for processing each request is composed of contributions from different methods, as illustrated by the breakdown in the figure.

Efficient Agency: The Promise of Reused Plans

AgentReuse demonstrably improves efficiency by utilizing previously generated plans, resulting in a substantial decrease in processing time. Testing indicates a 93.12% reduction in latency when employing AgentReuse compared to scenarios where no plan reuse mechanism is implemented. This performance gain stems from avoiding redundant computation by leveraging existing solutions, effectively minimizing the operational overhead associated with generating plans from scratch for each new request. The system’s capacity to rapidly access and apply prior plans directly translates to faster response times and increased throughput.

The AgentReuse system identifies relevant, previously generated plans by performing a semantic analysis of both the current user request and the plans stored in its repository. This analysis moves beyond simple keyword matching; the system understands the meaning of the request and plans, enabling it to accurately determine the degree of similarity even when phrasing differs. This semantic understanding is achieved through vectorization, converting textual data into numerical representations that facilitate efficient comparison and retrieval of plans applicable to the present task, allowing for rapid adaptation rather than complete plan regeneration.

The AgentReuse system utilizes vectorization to convert textual data, both incoming requests and previously generated plans, into numerical vectors. This conversion enables efficient similarity comparisons using mathematical operations on these vectors, identifying relevant plans without requiring exact string matches. Benchmarking demonstrates a 60.61% reduction in latency compared to traditional Large Language Model (LLM) response caching techniques, such as GPTCache, which rely on exact string matching or keyword-based retrieval. This improvement stems from the ability to rapidly assess semantic similarity, significantly decreasing the time required to locate and adapt reusable plans.

AgentReuse consistently achieves higher similarity scores than WithArgs, indicating superior performance in identifying reusable components.
AgentReuse consistently achieves higher similarity scores than WithArgs, indicating superior performance in identifying reusable components.

Beyond Single Requests: Decoding True Intent

Effective operation of intelligent agents necessitates accurate intent classification, and increasingly, this requires the capability of multi-intent classification. Traditional single-intent systems process requests sequentially, limiting efficiency when users articulate multiple, related needs within a single interaction. Multi-intent classification enables agents to simultaneously identify and address several distinct requests contained within a single user input, streamlining the interaction and reducing overall response time. This parallel processing capability is crucial for complex tasks and conversational interfaces where users often express multiple goals without explicitly delineating them, allowing for a more natural and efficient user experience.

Parameter Extraction is a critical component of complex intent decoding, functioning by identifying and isolating key pieces of information – such as dates, times, locations, entities, and quantities – embedded within a user’s request. This process moves beyond simply recognizing what the user wants to understanding details necessary to fulfill that request. The extracted parameters are then utilized as inputs for plan selection algorithms, ensuring the most appropriate course of action is chosen. Effective parameter extraction directly improves the accuracy and efficiency of task execution, enabling agents to move from broad intent recognition to precise action implementation without requiring further clarification from the user.

Similarity Evaluation improves intent decoding by quantifying the semantic relationship between user requests and pre-defined plans stored in a Semantic Cache. This process utilizes algorithms that yield a high degree of accuracy in matching requests to appropriate actions. Evaluations demonstrate an F1 Score of 0.9718 and an Accuracy of 0.9459, indicating robust performance. These metrics represent a significant improvement over currently available methods for complex intent classification and demonstrate the effectiveness of the Semantic Cache approach.

This execution graph visualizes the sequence of actions determined by the planning process to fulfill the user request.
This execution graph visualizes the sequence of actions determined by the planning process to fulfill the user request.

Scaling the Illusion: Deployment and Future Limitations

AutoGen simplifies the creation of sophisticated, large language model (LLM)-powered agents through a carefully designed framework that prioritizes modularity and ease of use. This architecture abstracts away much of the complexity traditionally associated with building multi-agent systems, allowing developers to focus on defining agent roles and interactions rather than managing low-level infrastructure. By providing pre-built components and a clear API, AutoGen significantly reduces development time and accelerates the prototyping process. The framework supports various conversational patterns, including both human-in-the-loop and fully autonomous agent interactions, and facilitates seamless integration with existing tools and APIs. This streamlined approach not only lowers the barrier to entry for developers but also enables rapid iteration and experimentation, ultimately fostering innovation in the field of LLM-driven automation.

The architecture underpinning AutoGen benefits significantly from integration with serverless computing platforms, enabling a deployment model characterized by both scalability and cost-efficiency. This approach allows for dynamic resource allocation, meaning agents are only instantiated and charged for when actively processing requests – eliminating the expenses associated with maintaining idle server capacity. As demand fluctuates, the system automatically scales to accommodate increased workloads, ensuring consistent performance without manual intervention. This on-demand provisioning, coupled with the pay-per-use billing of serverless functions, drastically reduces operational costs and facilitates the deployment of agent systems at a scale previously unattainable, making sophisticated LLM-driven automation accessible to a wider range of applications and budgets.

The incorporation of tool use dramatically enhances the functionality of large language model agents, enabling interactions with external systems and broadening access to diverse information sources. This capability isn’t simply about accessing more data; it facilitates a remarkable degree of planning efficiency. Recent studies demonstrate an Effective Plan Reuse Rate of 93%, indicating that agents can reliably adapt and repurpose previously successful strategies when encountering new, yet related, tasks. This high rate of reuse signifies a substantial reduction in computational cost and development time, as agents spend less effort reinventing solutions and more time executing them. The ability to leverage tools transforms agents from purely generative systems into proactive problem-solvers capable of complex, real-world applications.

AgentReuse streamlines the process of reusing existing agents by incorporating a retrieval mechanism, a verification step, and a planning module to adapt them to new tasks.
AgentReuse streamlines the process of reusing existing agents by incorporating a retrieval mechanism, a verification step, and a planning module to adapt them to new tasks.

The pursuit of efficiency, as illustrated by AgentReuse’s plan caching mechanism, inevitably introduces a new class of problems. Reducing latency through semantic similarity-a noble goal-merely shifts the battlefield. The bug tracker will fill with edge cases where similar intent doesn’t equate to identical execution. It’s a familiar pattern. The system optimizes for the average, and production reveals the long tail of exceptions. As Linus Torvalds once said, “Most programmers think that if their code works, it is finished. But I think it is never finished.” This framework, designed to reuse plans, will ultimately require constant refinement and adaptation. The elegance of the initial design will be eroded by the relentless demands of real-world application. They don’t deploy-they let go.

What’s Next?

AgentReuse, as presented, addresses a predictable bottleneck: the cost of repeated ideation. It’s a practical solution, and those tend to survive. However, the semantic similarity metric, while effective in controlled conditions, will inevitably encounter the ambiguities of production. Every edge case, every unforeseen interaction, will subtly shift the meaning of ‘reusable’-a constant recalibration of the acceptable margin for error. The 93.12% latency reduction is a snapshot, a momentary reprieve before the entropy of real-world data begins to erode its efficiency.

The architecture isn’t a diagram of elegant reuse, but a compromise that survived deployment. Future iterations will likely focus not on finding identical plans, but on gracefully degrading existing ones. A mechanism for plan repair, perhaps, or for intelligently merging near-misses. The serverless and AIoT contexts further complicate matters. Scale isn’t merely a matter of replication; it’s a distributed erosion of consistency.

Everything optimized will one day be optimized back. The real challenge isn’t building a perfect cache, but building a system that accepts its imperfections. It’s not about eliminating the cost of planning, but about amortizing the cost of failure-and occasionally, resuscitating hope from the logs.


Original article: https://arxiv.org/pdf/2512.21309.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-12-27 10:41