Author: Denis Avetisyan
New research suggests that intelligent agents, from humans to AI, aren’t simply maximizing rewards, but actively seeking out and capitalizing on prediction errors.
This review proposes a unifying framework based on ‘subjective functions’ that prioritize maximizing expected prediction error as a core principle of goal selection and rational behavior within a Markov Decision Process.
It remains a challenge to explain how intelligent agents-including humans-dynamically formulate goals. The paper ‘Subjective functions’ proposes a framework addressing this by positing that agents are fundamentally driven to maximize expected prediction error, embodied in a higher-order ‘subjective function’ defined by the agent’s internal features. This suggests goal selection isn’t externally imposed, but emerges from an intrinsic drive to better understand the environment. Could replicating this ability to synthesize objectives be a crucial step toward truly intelligent artificial systems?
The Fragility of Externally Driven Systems
Conventional reinforcement learning systems frequently falter when faced with real-world complexity due to their dependence on externally specified reward functions. These systems excel only within the narrow confines of their programmed objectives; any deviation from precisely defined parameters results in suboptimal, or even nonsensical, behavior. This brittleness arises because agents optimize for the reward signal, not necessarily towards the intended goal; a cleverly exploited reward structure can lead to unintended consequences and “reward hacking,” where the agent achieves high scores through trivial or undesirable actions. Consequently, such systems struggle to generalize beyond their training environment and often require extensive, hand-crafted reward engineering, limiting their adaptability and hindering the development of truly autonomous agents capable of navigating unpredictable scenarios.
The capacity for agents to thrive in genuinely complex environments hinges not on externally dictated rewards, but on intrinsic motivation – an internally generated drive to explore, learn, and master new skills. Traditional reinforcement learning, while successful in constrained settings, often falters when confronted with unstructured scenarios because it relies on pre-defined goals that may be sparse or misleading. An agent driven by curiosity, however, actively seeks out novelty and challenges, allowing it to build a rich internal model of the world without needing constant external feedback. This self-directed learning fosters resilience and adaptability, enabling the agent to discover unforeseen solutions and generalize its knowledge to previously unseen situations – a crucial advantage when operating in dynamic, unpredictable environments where pre-programmed responses are insufficient. Such internally motivated agents are not simply reacting to stimuli; they are actively constructing understanding, paving the way for more robust and intelligent artificial systems.
A significant hurdle in developing truly autonomous agents lies in sustaining their drive to learn without constant external prompting. Current intrinsic motivation systems, while capable of initiating exploration, often falter when faced with complex tasks or environments. Agents can become fixated on easily exploitable loopholes – a phenomenon known as reward hacking – where they maximize their internal reward signal in ways that are counterproductive or meaningless to the intended goal. This isn’t merely a programming bug; it highlights a fundamental challenge in aligning an agent’s internal incentives with desired behavior. Researchers are actively investigating methods to create more robust and nuanced intrinsic rewards, focusing on curiosity-driven learning and goal-conditioned exploration, but maintaining persistent, goal-directed motivation remains a key area for advancement in artificial intelligence.
Expected Prediction Error: The Foundation of Intrinsic Drive
Expected Prediction Error (EPE) posits that agents are driven by an intrinsic motivation to minimize the discrepancy between their internal predictions and observed reality. This suggests a fundamental principle where agents actively seek to improve the accuracy of their internal models, not through external rewards, but by reducing the error inherent in their predictive capabilities. Essentially, the agent is motivated to make better predictions about its environment, and the magnitude of the prediction error – the expected difference between predicted and actual outcomes – serves as the basis for this intrinsic drive. This inherent motivation to minimize $EPE$ encourages continuous learning and adaptation, even in the absence of explicit external reinforcement.
Expected Prediction Error (EPE) functions as an intrinsic reward signal, driving agent behavior without reliance on external reinforcement. This internal reward is based on the magnitude of the difference between predicted and actual outcomes; specifically, utilizing the absolute value of the Temporal Difference (TD) error – $δ = r + γV(s’) – V(s)$ – as a reward promotes exploration and learning. Prior research has established that maximizing this absolute TD error encourages agents to visit novel states and reduce model uncertainty, effectively improving learning efficiency and competence in complex environments. By rewarding the reduction of prediction error, EPE facilitates continuous adaptation and skill acquisition, independent of externally defined goals.
Expected Prediction Error (EPE) is directly correlated to the accuracy of an agent’s internal model, specifically as represented by its Value Function, $V(s)$. A more accurate internal model results in lower prediction errors, and consequently, a reduced EPE. This principle builds upon earlier research demonstrating that utilizing unsigned Temporal Difference (TD) errors – which measure the discrepancy between predicted and actual rewards – as intrinsic rewards can effectively improve an agent’s overall competence and learning efficiency. The Value Function serves as the agent’s estimate of future rewards, and minimizing the difference between predicted and actual values, as captured by the unsigned TD error, encourages the agent to refine its internal model and improve its predictive capabilities, thereby driving down EPE.
Goal Selection Driven by Predictive Error
EPE-Driven Goal Selection operates on the principle that an agent’s intrinsic motivation can be generated by maximizing its own prediction error. This is achieved by having the agent actively select goals for which its internal predictive model performs poorly. The magnitude of the expected prediction error, $EPE$, serves as an intrinsic reward signal; higher $EPE$ indicates greater novelty or surprise, incentivizing the agent to explore those states. This contrasts with traditional reward-based learning where goals are externally defined. By prioritizing states where predictions fail, the agent effectively learns a more comprehensive model of its environment, enhancing its ability to generalize and adapt to new situations. The selection process is therefore a key component in driving curiosity-based exploration without requiring explicitly defined rewards.
Generalized Advantage Estimation (GAE) is a policy optimization technique used to reduce the variance of Monte Carlo returns while maintaining a low bias. It achieves this by computing a weighted average of $n$-step advantage estimates, where the weighting is determined by a parameter $\lambda$. A $\lambda$ value of 0 corresponds to a Monte Carlo return (high variance, low bias), while a $\lambda$ of 1 corresponds to a $V$-function estimate (low variance, high bias). Intermediate values of $\lambda$ provide a trade-off, allowing the algorithm to efficiently estimate the long-term impact of actions by combining the benefits of both approaches. The choice of $\lambda$ directly impacts the stability and sample efficiency of the learning process, requiring careful tuning for optimal performance.
Meta-learning, or “learning to learn,” enables agents to acquire skills that facilitate rapid adaptation to previously unseen tasks and environments. This is achieved by training the agent on a distribution of tasks, allowing it to identify underlying patterns and generalize learning strategies. Instead of learning a specific task in isolation, the agent learns an inductive bias – a set of prior assumptions – that guides exploration and accelerates learning in novel situations. Formally, meta-learning seeks to optimize a learning algorithm itself, rather than the parameters of a specific policy, thereby improving sample efficiency and generalization performance across a range of tasks. This contrasts with traditional reinforcement learning where the learning algorithm remains fixed and only policy parameters are adjusted.
Sustaining Motivation: Beyond Transient Rewards
The human brain is remarkably adept at seeking pleasure, yet paradoxically, this very pursuit often undermines sustained motivation. This phenomenon, known as hedonic adaptation, describes the frustrating tendency to experience diminishing returns from rewarding stimuli; what initially brings joy quickly becomes commonplace, requiring ever-increasing levels of stimulation to maintain the same emotional response. This rapid desensitization poses a significant challenge to intrinsic motivation, as the brain’s reward system, when consistently satisfied with static rewards, effectively ceases to be driven by them. Consequently, individuals may struggle to maintain engagement in activities they once found fulfilling, leading to boredom, apathy, and a constant search for novelty – a cycle that can hinder long-term learning and persistent effort. Understanding hedonic adaptation is therefore crucial for designing systems – and fostering environments – that can sustain engagement and promote lasting motivation.
The brain’s reward system often diminishes its response to consistent stimuli, a phenomenon known as habituation, which can undermine motivation when relying on fixed external rewards. However, Evidence Predictive Error (EPE) presents a compelling alternative by functioning as a constantly shifting reward signal. Rather than delivering a static dose of dopamine for a completed task, EPE rewards the process of learning itself, specifically the reduction of the difference between predicted and actual outcomes. This dynamic feedback loop ensures that rewards are only generated when an agent encounters something genuinely new or surprising, effectively circumventing the rapid desensitization associated with fixed rewards and fostering sustained engagement with the environment. Consequently, EPE promotes a self-regulating system where learning and motivation are intrinsically linked, driving persistent exploration and refinement of an agent’s internal model of the world.
The brain appears fundamentally motivated by minimizing the discrepancy between expectation and reality – a process known as reducing prediction error. This isn’t simply about being ‘correct’ but rather a dynamic drive that compels agents – be they biological or artificial – to actively seek information that resolves uncertainty. When an outcome deviates from what is predicted, it triggers a signal that fuels exploration of new situations and refinement of internal models of the world. This continuous cycle of prediction, error detection, and model updating isn’t just a mechanism for accurate forecasting; it is, crucially, the engine of long-term learning, allowing agents to adapt to changing environments and continually expand their knowledge base. The pursuit of reducing prediction error, therefore, provides an inherent and self-sustaining reward signal, circumventing the pitfalls of fixed rewards and fostering persistent engagement with the surrounding world.
The pursuit of intelligence, as outlined in this exploration of subjective functions, hinges on an agent’s capacity to navigate uncertainty and refine its internal models. This resonates deeply with the sentiment expressed by David Hilbert: “We must be able to argue that man can surely know something.” The article posits that maximizing expected prediction error isn’t merely a computational trick, but a fundamental drive-a constant striving to reduce the unknown. This echoes Hilbert’s assertion; the ability to synthesize objective functions, to build predictive models of the world, is not simply about achieving goals but about establishing a foundation of knowable truths. As the system evolves, each optimization, each refinement of the subjective function, creates new tension points, demanding further adaptation and a more nuanced understanding of the environment. The architecture, therefore, isn’t a static blueprint, but a dynamic response to the inherent unpredictability of existence.
Future Directions
The proposition that agents maximize expected prediction error, rather than some neatly defined reward, feels less like an explanation and more like a shifting of the problem. It does not eliminate the need to define what constitutes a ‘surprising’ state, merely pushes that burden further down the chain. A truly general intelligence will likely require a means of generating subjective functions, of constructing predictive models not just of the world, but of its own internal state and motivational structure. This suggests a hierarchy of prediction, where the ability to anticipate one’s own predictions becomes paramount.
Current frameworks, predicated on Markov Decision Processes, may prove inadequate. The assumption of stationarity-that the world behaves consistently-is a convenient fiction. A more robust approach might involve explicitly modeling uncertainty about the structure of the environment itself, acknowledging that the ‘rules’ are not fixed but are themselves subject to prediction and revision. The elegance of a simple objective function is appealing, but nature rarely favors simplicity over adaptability.
If a design feels clever, it’s probably fragile. The pursuit of increasingly complex architectures risks obscuring the fundamental principles at play. The focus should remain on parsimony, on identifying the minimal set of assumptions necessary to explain intelligent behavior. The true test will not be whether an agent can solve a specific task, but whether it can learn to define its own tasks, to reshape its own goals in response to a changing world.
Original article: https://arxiv.org/pdf/2512.15948.pdf
Contact the author: https://www.linkedin.com/in/avetisyan/
See also:
- Mobile Legends: Bang Bang (MLBB) Sora Guide: Best Build, Emblem and Gameplay Tips
- Brawl Stars December 2025 Brawl Talk: Two New Brawlers, Buffie, Vault, New Skins, Game Modes, and more
- Clash Royale Best Boss Bandit Champion decks
- Best Hero Card Decks in Clash Royale
- Best Arena 9 Decks in Clast Royale
- Call of Duty Mobile: DMZ Recon Guide: Overview, How to Play, Progression, and more
- Clash Royale December 2025: Events, Challenges, Tournaments, and Rewards
- Clash Royale Best Arena 14 Decks
- Clash Royale Witch Evolution best decks guide
- All Brawl Stars Brawliday Rewards For 2025
2025-12-21 00:00