Teaching Language Models to Critique and Improve

Author: Denis Avetisyan

A new algorithm allows AI agents to learn more effectively from natural language feedback, boosting performance on complex tasks.

The system employs a language-based Bellman backup during policy evaluation, operating directly within textual space, and subsequently distills an improved policy from a refinement process, acknowledging the inevitable accumulation of technical debt even within seemingly elegant frameworks.

This paper introduces Natural Language Actor-Critic (NLAC), a scalable off-policy reinforcement learning approach utilizing language-based critiques and a language Bellman backup for improved sample efficiency in language model agents.

Training large language model (LLM) agents-systems capable of complex, multi-step reasoning and interaction-remains challenging due to the noisy and sparse rewards often encountered in long-horizon tasks. This paper introduces Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space, a novel actor-critic algorithm that leverages a generative LLM critic to provide richer, natural language feedback instead of scalar rewards. By offering actionable explanations for suboptimal actions, NLAC enhances learning efficiency and stability without relying on traditional policy gradients. Could this approach unlock a new paradigm for training more robust and scalable LLM agents capable of tackling increasingly complex real-world problems?

The Limits of Pattern Recognition: Why LLMs Struggle with True Agency

Large language models demonstrate remarkable proficiency in various tasks, from generating human-quality text to translating languages and summarizing information. However, these models often falter when confronted with problems demanding extended, multi-step reasoning – a critical component of true agency. While adept at recognizing patterns and recalling information, LLMs struggle to consistently plan, execute, and revise complex sequences of actions to achieve a distant goal. This limitation stems from their foundational architecture, primarily focused on predicting the next token in a sequence, rather than simulating a world model or maintaining a persistent internal state to track the consequences of actions over time. Consequently, scenarios requiring foresight, strategic adaptation, and the ability to recover from errors present significant challenges, hindering their capacity to operate effectively in dynamic, real-world environments that necessitate more than just immediate pattern recognition.

Traditional reinforcement learning methodologies frequently demand an immense number of interactions with an environment to achieve proficiency, a process often proving prohibitively expensive or even impossible in real-world applications. Consider scenarios like robotics or complex game playing; physically executing and iterating through millions of actions is impractical due to time constraints, resource limitations, or safety concerns. This requirement for extensive environmental interaction stands in stark contrast to human learning, which often leverages prior knowledge and abstract reasoning to generalize from limited experience. Consequently, the sample inefficiency of conventional reinforcement learning presents a significant obstacle to deploying intelligent agents in dynamic and costly environments, driving research towards more data-efficient algorithms and techniques like imitation learning and model-based reinforcement learning.

The efficacy of Supervised Fine-Tuning (SFT) in enhancing Large Language Model (LLM) performance is often constrained by a critical bottleneck: the scarcity of meticulously labeled data, particularly when addressing tasks demanding intricate, multi-step reasoning. While SFT can significantly improve an LLM’s ability to follow instructions and generate desired outputs, it fundamentally relies on providing the model with numerous examples of correct input-output pairings. Constructing these datasets for complex challenges-such as strategic planning, scientific discovery, or creative problem-solving-proves exceptionally difficult and expensive. The labeling process requires not only significant human effort but also specialized expertise to ensure accuracy and consistency. Consequently, the performance of SFT-tuned LLMs often plateaus due to insufficient training examples, hindering their ability to generalize effectively to novel situations and truly exhibit agency beyond memorization.

Despite correctly identifying the hidden object as a non-red fruit commonly found in salads, the LLM agent inefficiently focuses on color rather than more defining characteristics like taste or size when attempting to guess “raisin” on the 20Q game.

From Discrete States to Continuous Language: A Shift in Reinforcement Learning

Natural Language Reinforcement Learning (NLRl) fundamentally shifts the operational space of reinforcement learning agents from discrete states and actions to continuous language space. This is achieved by representing states, actions, and rewards as natural language tokens or embeddings, allowing the agent to directly process and generate language. By operating within language space, the agent’s decision-making process becomes inherently more interpretable as its actions and reasoning are expressed in human-readable terms. Furthermore, this approach offers increased flexibility; the same agent can be adapted to new tasks or environments simply by modifying the language used to define the reward function and task instructions, without requiring retraining of the underlying model or significant alterations to its architecture.

Traditional reinforcement learning often requires agents to learn through trial and error within a defined environment, necessitating a large number of interactions to map states to optimal actions and associated rewards. Natural Language Reinforcement Learning (NLRl) circumvents this limitation by directly representing these core elements – rewards, states, and actions – using natural language. Instead of exploring a physical or simulated environment, the agent learns from linguistic descriptions. For example, a reward might be defined as “achieve a high score,” a state as “the player is near the goal,” and an action as “move forward.” This linguistic framing allows the agent to generalize more effectively and reduces the sample complexity required for learning, as the agent can leverage pre-existing knowledge encoded in language models and reason about the task description rather than relying solely on environmental feedback.

Framing reinforcement learning within natural language facilitates the development of AI agents more readily aligned with human intent and capable of adapting to novel situations. Traditional reinforcement learning relies on explicitly defined reward functions and state spaces, often requiring extensive engineering and limiting generalization. Utilizing natural language allows for the specification of goals and constraints through human-understandable instructions, enabling agents to interpret and respond to complex, nuanced requests. This linguistic interface reduces the need for exhaustive environmental interaction for learning, as agents can leverage pre-existing knowledge embedded in language models. Furthermore, the inherent ambiguity and expressiveness of natural language permits agents to handle unforeseen circumstances and adjust their behavior based on contextual understanding, leading to increased adaptability and robustness in dynamic environments.

A natural language critic successfully identified and explained a suboptimal action by a base language model-an invalid database modification-allowing the model to correct itself and continue the task.

NLAC: Learning with a Linguistic Feedback Loop

The Language Critic component within the NLAC framework functions as an evaluator of agent actions, generating textual feedback intended to facilitate iterative improvement. This component receives as input the current state, the agent’s action, and the resulting next state. Based on this information, it constructs a natural language critique detailing the action’s efficacy and potential areas for adjustment. This critique is not simply descriptive; it’s designed to be informative enough to guide the agent towards more successful strategies in subsequent interactions, effectively serving as a readily interpretable reward signal beyond traditional scalar rewards. The output is a sequence of tokens representing the critique, which is then utilized by the Refinement Policy.

The Refinement Policy within the NLAC framework functions as a policy gradient algorithm that directly optimizes agent behavior based on the natural language critiques received from the Language Critic. This policy, parameterized by $\theta$, is updated to maximize expected cumulative reward, utilizing the critique as a reward signal. Specifically, the policy gradient is estimated through Monte Carlo rollouts, where the agent executes actions, receives critiques, and the policy is adjusted to increase the probability of actions leading to positive feedback. This creates a closed-loop system: the agent acts, the critic evaluates, and the refinement policy modifies the agent’s behavior, iteratively improving performance based on linguistic guidance.

The Language Successor Model (LSM) is a core component enabling prediction of future states and outcomes within the NLAC framework. The LSM operates by representing states as embeddings in a language model, allowing for generalization across similar situations. Training the LSM efficiently is achieved through the Language Bellman Backup, an iterative process that updates the language successor function by leveraging observed transitions and critiques. Specifically, the backup utilizes the Bellman equation, $V(s) = \mathbb{E}_{\pi}[R(s,a) + \gamma V(s’)]$, adapted for language embeddings, to propagate value estimates and refine the model’s understanding of state transitions and their associated rewards, thereby improving prediction accuracy and enabling more effective refinement of agent policies.

The NLAC framework models agent interactions as a $Markov Decision Process (MDP)$ defined by a tuple of states $S$, actions $A$, transition probabilities $P(s’|s,a)$, reward function $R(s,a)$, and discount factor $\gamma$. In this formulation, the agent observes a state $s \in S$, takes an action $a \in A$, transitions to a new state $s’$ with probability $P(s’|s,a)$, and receives a reward $R(s,a)$. The MDP provides a formal structure for defining the agent’s learning problem, allowing for the application of reinforcement learning algorithms to optimize the agent’s policy for maximizing cumulative rewards. Critiques generated by the Language Critic component are incorporated as components of the reward signal or as features influencing the transition probabilities, effectively shaping the MDP and guiding the agent’s learning process.

Nonlinear Advantage Critic (NLAC) consistently converges to a stable policy with fewer training samples than Proximal Policy Optimization (PPO), as demonstrated across multiple independent runs.

Beyond Brittle Memories: Mitigating Catastrophic Forgetting

Continual learning agents, designed to acquire knowledge over time, often struggle with a phenomenon known as catastrophic forgetting. This occurs when the assimilation of new information drastically overwrites previously learned skills and data, effectively causing the agent to ‘forget’ how to perform earlier tasks. Unlike human learning, which naturally incorporates new knowledge while retaining past experiences, standard machine learning models exhibit a tendency towards abrupt knowledge replacement. This presents a major obstacle in real-world applications where agents must adapt to evolving environments without losing proficiency in previously mastered skills; the inability to retain past knowledge limits the development of truly autonomous and versatile artificial intelligence systems.

The development of Neural Language-based Agent for Continual learning (NLAC) addresses a core challenge in artificial intelligence: retaining previously learned skills while acquiring new ones. Unlike traditional reinforcement learning algorithms prone to catastrophic forgetting – the abrupt loss of prior knowledge upon learning new tasks – NLAC incorporates techniques designed to preserve past experiences. This is achieved through a sophisticated system that doesn’t simply overwrite existing neural pathways when faced with novel environments. Instead, the agent strategically integrates new information with its established knowledge base, ensuring that performance on older tasks doesn’t degrade as it masters new ones. This ability to learn continuously and adapt without forgetting is crucial for deploying agents in real-world scenarios where environments are constantly changing and long-term performance is paramount.

The algorithm constructs a framework for robust learning by translating environmental observations into meaningful language-based representations. This linguistic encoding allows the system to generate what are known as Successor Features – abstract, high-level characteristics of states that capture essential information for predicting future rewards. Crucially, a Language Critic evaluates these representations, ensuring they are both accurate and relevant to the task at hand. By focusing on these learned features rather than raw sensory inputs, the agent develops a capacity for generalization; it can effectively apply previously acquired knowledge to novel situations and environments, exhibiting a level of adaptability that surpasses conventional reinforcement learning approaches. This method fosters behavior that is less susceptible to the nuances of specific environments and more aligned with underlying principles, resulting in consistently improved performance and resilience.

The implementation of Policy Gradient methods within NLAC demonstrably bolsters both the stability and overall performance of continual learning. This approach moves beyond simple fine-tuning by optimizing the agent’s behavior policy directly, resulting in more consistent adaptation to new tasks without sacrificing previously learned skills. Empirical evaluation reveals a substantial advantage over standard Reinforcement Learning fine-tuning; NLAC achieves up to a 30% improvement on benchmark tasks like 20Q and τ-bench. This significant gain underscores the effectiveness of Policy Gradient integration in mitigating catastrophic forgetting and fostering a more robust, generalizable learning agent capable of sustained performance across evolving environments.

The pursuit of scalable off-policy learning, as demonstrated by NLAC, feels predictably optimistic. It attempts to tame the chaos of complex tasks with language-based critiques and Bellman backups – elegant structures destined to encounter the brutal realities of production. One anticipates the inevitable edge cases, the unforeseen interactions that will expose the limitations of even the most sophisticated reward shaping. As Edsger W. Dijkstra observed, “Simplicity is prerequisite for reliability.” While NLAC introduces a compelling level of abstraction, it remains to be seen how gracefully it will degrade when faced with the sheer unpredictability of real-world deployment. Every abstraction dies in production, and this one will likely die beautifully, but crash nonetheless.

The Road Ahead

NLAC, like all elegantly constructed frameworks, now enters the proving ground. The language Bellman backup is a neat trick, certainly, but production environments rarely respect theoretical sample efficiency. The true cost will emerge when scaling beyond contrived tasks; reward shaping, even in natural language, remains an art, not a science. Expect a proliferation of hand-tuned prompts masquerading as generalizable solutions.

The current focus on multi-step reasoning is laudable, yet obscures a more fundamental issue. Language models, even those guided by reinforcement learning, are still fundamentally predictive. They excel at mimicking competence, not demonstrating it. The next iteration will likely involve attempts to ground these agents in something resembling real-world state, a task that will inevitably reveal the limits of purely linguistic understanding.

Ultimately, NLAC, and its successors, will likely become another layer in the existing tech debt. The goalposts will shift. The problems will grow more complex. But that, of course, is simply proof of life. The real question isn’t whether these agents will succeed, but what new failures will teach.

Original article: https://arxiv.org/pdf/2512.04601.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Limits of Pattern Recognition: Why LLMs Struggle with True Agency

From Discrete States to Continuous Language: A Shift in Reinforcement Learning

NLAC: Learning with a Linguistic Feedback Loop

Beyond Brittle Memories: Mitigating Catastrophic Forgetting

The Road Ahead

See also: