Planning with Language: Guiding AI Through Complex Worlds

Author: Denis Avetisyan

Researchers are leveraging the power of large language models to enable more effective decision-making for artificial intelligence agents operating in dynamic, real-world environments.

The system integrates large language models and reinforcement learning through a structured framework-employing subgoal graphs with both OR- and AND-edges to represent background knowledge-allowing an agent to optimize its policy via environmental interaction and coordinated planning guided by a subgoal tracker.

This work introduces a novel framework, SGA-ACR, that augments LLM-based planning with environmental knowledge through subgoal graphs and a multi-LLM actor-critic-refiner architecture.

While large language models show promise in high-level planning for reinforcement learning, a critical gap often exists between abstract intentions and successful environmental interaction. This paper, ‘Subgoal Graph-Augmented Planning for LLM-Guided Open-World Reinforcement Learning’, addresses this misalignment by introducing a framework that integrates environment-specific knowledge via subgoal graphs with a multi-LLM actor-critic-refiner architecture. Our approach generates more executable and verifiable subgoals, demonstrably improving agent performance in complex, open-world scenarios. Could this represent a crucial step towards more robust and adaptable LLM-guided agents capable of thriving in truly dynamic environments?

Navigating Complexity: The Foundations of Intelligent Action

Conventional artificial intelligence systems encounter significant obstacles when operating within open-world environments due to their inherent ambiguity and expansive scale. These systems, often reliant on predefined parameters and limited datasets, struggle to process the infinite possibilities and unpredictable nature of such spaces. Robust planning capabilities are therefore crucial, demanding AI that can not only chart a course but also anticipate potential challenges, adapt to dynamic changes, and reason about incomplete information. The sheer volume of data and the constant need for real-time decision-making place substantial computational burdens on traditional algorithms, often leading to brittle performance and an inability to generalize beyond narrowly defined scenarios. Overcoming these limitations necessitates a shift towards more flexible, adaptive, and knowledge-driven approaches to AI planning.

Successful navigation within complex open-world environments extends far beyond simply charting a course from point A to point B. Truly effective agents require a nuanced comprehension of how entities – objects, characters, and even abstract concepts – relate to one another, and how these relationships influence task completion. An agent must discern not only where to go, but why, understanding that a path blocked by a moving object necessitates a revised plan, or that completing a prerequisite task unlocks access to a previously unreachable area. This demands a system capable of modeling dependencies – recognizing, for example, that acquiring a key is logically prior to opening a locked door – and dynamically adjusting strategies based on evolving circumstances and the actions of other entities within the world. Without this deep understanding of interconnectedness, an agent risks becoming lost in a sea of possibilities, unable to prioritize actions or adapt to the unpredictable nature of a genuinely open-world setting.

Existing AI navigation systems frequently stumble when confronted with the unpredictable nature of real-world environments, largely due to a reliance on pre-programmed responses and limited contextual awareness. These systems often struggle to dynamically adjust to novel obstacles, shifting priorities, or unexpected changes in the landscape, resulting in inefficient or failed task completion. A key limitation lies in their inability to effectively integrate and utilize readily available environmental cues – such as recognizing potential shortcuts, anticipating the behavior of other agents, or understanding the implications of weather conditions – which a human effortlessly incorporates into their planning. This lack of adaptability and environmental reasoning ultimately restricts the long-term viability of these agents in complex, open-world scenarios, hindering their capacity for sustained autonomous operation and robust performance.

LLM-Based Planning: A Blueprint for Coherent Action

LLM-based Planning within this framework utilizes Large Language Models to generate potential action sequences, termed candidate plans. These plans are not created in isolation; they are constructed by referencing a ‘Subgoal Graph’. This graph serves as a structured representation of both environmental knowledge – detailing static elements and dynamic relationships – and task hierarchies, breaking down complex objectives into manageable subgoals. The Subgoal Graph enables the LLM to reason about preconditions, effects, and dependencies between actions, facilitating the creation of plans that are logically coherent and contextually relevant. The LLM accesses and utilizes this graph to inform its planning process, ensuring proposed actions align with the understood environment and contribute to the overall task completion.

The planning process utilizes two Large Language Models (LLMs) operating in a collaborative loop. The Actor LLM is responsible for generating candidate plans, drawing upon information stored in the Entity Knowledge Base, which contains details about objects, locations, and relationships within the environment. These proposed plans are then submitted to the Critic LLM, which assesses their viability based on pre-defined goals and constraints. The Critic LLM evaluates factors such as resource requirements, potential obstacles, and adherence to safety protocols, providing feedback to refine or reject the proposed plan. This iterative Actor-Critic dynamic ensures that generated plans are not only strategically sound but also practically feasible and aligned with the overarching objectives of the system.

The collaborative planning process integrates logical reasoning with environmental awareness through a dual-LLM system. The Actor LLM generates plans based on task decomposition represented in the Subgoal Graph and informed by the Entity Knowledge Base, while the Critic LLM assesses these plans for feasibility and goal alignment. This evaluation isn’t merely syntactic; the Critic LLM utilizes the Entity Knowledge Base to verify that proposed actions are physically possible and consistent with known environmental constraints. Consequently, plans are not only logically coherent but also grounded in a detailed, structured understanding of the operating environment, minimizing the risk of proposing actions that are theoretically sound but practically infeasible.

The reinforcement learning (RL) agent benefits from a diverse action space generated by the LLM-based planning framework. Instead of relying on a single proposed plan, the system provides multiple candidate plans for the RL agent to evaluate. This expanded set of options allows for more robust learning, as the agent can differentiate between effective and ineffective strategies across a wider range of potential actions. The resulting data increases sample efficiency and facilitates exploration of the environment, enabling the RL agent to learn optimal policies more quickly and reliably than with a limited action space. Furthermore, the variety of plans offers opportunities for the RL agent to discover unforeseen solutions and generalize its learning to novel situations.

SGA-ACR consistently outperforms LLM-guided reinforcement learning baselines across a range of model sizes, demonstrating its superior scalability and performance.

Refinement and Execution: Bridging the Conceptual and the Practical

The Refiner LLM operates on the highest-ranked plan identified by the Critic LLM, performing iterative refinement to enhance its quality and feasibility. This process involves detailed examination of the plan’s steps and potential adjustments based on the LLM’s internal knowledge and reasoning capabilities. Crucially, the Refiner LLM is not limited to solely modifying the top-ranked plan; it can also integrate valuable components or strategies identified within alternative, lower-ranked plans. This selective incorporation of insights from diverse options aims to create a more robust and effective final plan, leveraging the collective potential of all generated proposals.

The refinement of the initial plan is directly informed by feedback generated by the Critic LLM. This feedback encompasses assessments of plan feasibility – evaluating whether proposed actions are realistically executable within the constraints of the environment – and goal alignment, which confirms that each step contributes to the overarching objective. The Refiner LLM utilizes this structured critique to modify the plan, addressing identified weaknesses and optimizing for both practicality and effectiveness. This iterative process ensures the final, executed plan is not only theoretically sound but also demonstrably capable of achieving the desired outcome, as validated by the Critic’s criteria.

The Reinforcement Learning (RL) Agent operates within the Crafter Environment to enact the refined plan generated by the LLM chain. This execution isn’t a passive process; the Agent actively interacts with the environment, performing actions as defined by the plan and receiving corresponding reward signals. These rewards, whether positive or negative, serve as feedback, enabling the Agent to learn and adjust its behavior over time. This learning process is iterative, allowing the Agent to optimize its execution strategy based on the consequences of its actions within the Crafter Environment, ultimately maximizing cumulative reward.

Evaluation at the 5 million step mark demonstrates a substantial performance advantage for our approach over baseline methods. Specifically, the implemented system achieves significantly higher cumulative rewards and overall scores, indicating improved learning efficiency within the Crafter Environment. This performance difference is visually represented in Figure 8, which quantifies the gains observed during the RL Agent’s training process. These results suggest that the combined refinement and execution pipeline effectively optimizes plan generation and subsequent learning, leading to demonstrably superior outcomes compared to standard techniques.

The Subgoal Tracker identifies successful completion of tasks like placing a table, updating relevant connections with new weights and rewarding the agent for progress.

Mitigating Over-Refinement: Maintaining Efficiency and Robustness

The tendency for planning algorithms to endlessly refine already-adequate plans – known as the ‘Over-Refinement Problem’ – is addressed through a novel ‘Subgoal Tracker’. This system actively monitors the execution of a generated plan, assessing whether individual subgoals are being successfully achieved. Crucially, the Subgoal Tracker doesn’t simply evaluate completion; it dynamically updates the weights associated with different actions within the planning graph. If a subgoal is consistently met with minimal effort, the associated actions receive reduced weight, discouraging further, unnecessary modifications to that part of the plan. This adaptive weighting system prioritizes refinement only where it’s genuinely needed, leading to more efficient resource allocation and preventing the algorithm from getting stuck in cycles of unproductive adjustments.

The successful execution of complex plans often hinges on an agent’s consistent progress through intermediate steps, and reward shaping serves as a crucial mechanism to guarantee adherence to those steps. This technique involves providing the agent with carefully designed rewards not just for achieving the ultimate goal, but also for successfully completing individual subgoals within the plan. By reinforcing these incremental achievements, the agent is incentivized to stay on course, preventing deviations that could lead to failure in dynamic or unpredictable environments. This proactive approach ensures the agent doesn’t simply optimize for the final outcome, but actively pursues the planned sequence of actions, resulting in more reliable and robust performance even when faced with unexpected challenges or disturbances.

The system’s architecture fosters markedly efficient planning and execution, ultimately enhancing performance within challenging environments. By dynamically adjusting to the demands of a task, the framework avoids the pitfalls of rigid, pre-defined strategies. This adaptability stems from a continuous feedback loop where completed subgoals reinforce effective pathways and discourage superfluous modifications to the overall plan. Consequently, the agent conserves computational resources, navigates complex scenarios with greater speed, and achieves higher success rates compared to methods reliant on static planning. This streamlined process translates to a demonstrable advantage in open-world settings, where unpredictable elements frequently necessitate real-time adjustments and robust problem-solving capabilities.

The developed framework consistently demonstrates exceptional performance across a diverse range of challenges, achieving the highest success rates – in 20 out of 22 tested achievements – within complex, open-world environments. This notable result highlights the system’s robust adaptability and ability to generalize learned strategies to previously unseen scenarios. Unlike many planning algorithms that struggle with the inherent variability of real-world simulations, this framework maintains a high degree of reliability, consistently completing objectives even when faced with unpredictable conditions and dynamic obstacles. The consistently high success rate indicates the effectiveness of the underlying mechanisms for mitigating over-refinement and ensuring faithful execution of generated plans, ultimately leading to a more dependable and versatile agent.

Training progressively establishes relationships between subgoals-from collecting coal to placing items on the table, and ultimately to initiating sleep-as reflected by the evolving weights in the subgoal graph.

The pursuit of robust agency in open-world reinforcement learning necessitates a holistic understanding of environmental structure, a principle echoed in Donald Knuth’s observation: “Premature optimization is the root of all evil.” The SGA-ACR framework detailed in this work embodies this sentiment; by prioritizing the construction of a comprehensive subgoal graph, the system avoids optimizing for short-sighted gains. Instead, it builds a foundational knowledge base-a clear representation of the environment-that allows the agent to navigate complexity. This deliberate emphasis on structure, rather than immediate performance, enables more resilient and adaptable decision-making, especially crucial when dealing with the unpredictable nature of open-world scenarios. The architecture’s focus on interconnectedness-through the actor-critic-refiner loop-mirrors the need to consider the whole system when addressing potential weaknesses.

Beyond the Horizon

The integration of large language models with reinforcement learning, as demonstrated by this work, feels less like a solution and more like a shifting of the problem. The elegance of planning through subgoal graphs lies in its attempt to impose structure on inherently chaotic environments, but the true challenge remains: representing the environment itself. The Actor-Critic-Refiner architecture, while promising, inherits the limitations of its linguistic foundation-knowledge is still filtered through the lens of textual data, a necessarily incomplete representation of physical reality. Future work must address this fundamental disconnect.

A critical path forward involves moving beyond solely linguistic knowledge. Incorporating multi-modal inputs-vision, sound, even tactile feedback-offers a richer, more nuanced understanding of the world. Furthermore, the framework’s reliance on pre-defined subgoals presents a rigidity that limits true open-world adaptation. A system capable of dynamically generating and refining subgoals, based on real-time environmental feedback, would more closely resemble the flexible intelligence observed in biological systems.

Ultimately, the pursuit of truly intelligent agents demands a holistic approach. It is not enough to simply improve the planning algorithm; one must also grapple with the complexities of perception, representation, and embodied interaction. The illusion of intelligence arises from elegant structure, but sustained intelligence requires a system capable of evolving with, and within, the world it inhabits.

Original article: https://arxiv.org/pdf/2511.20993.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Navigating Complexity: The Foundations of Intelligent Action

LLM-Based Planning: A Blueprint for Coherent Action

Refinement and Execution: Bridging the Conceptual and the Practical

Mitigating Over-Refinement: Maintaining Efficiency and Robustness

Beyond the Horizon

See also: