Robots That Ask ‘Why?’

Author: Denis Avetisyan

New research demonstrates a method for robots to actively seek clarification from humans during learning, improving task understanding and reward function accuracy.

Human demonstrations of tasks often omit crucial details, leading learning algorithms to infer incorrect objectives; this work addresses this limitation by actively identifying underspecified features and soliciting targeted demonstrations with explanatory guidance, thereby ensuring accurate reward alignment and improved learning outcomes.

A framework called ASQ enables robots to identify ambiguity in demonstrations and request targeted feedback to recover misaligned rewards through Bayesian inference and feature attention.

Learning from demonstration relies on the assumption of complete supervisory signals, yet human guidance is often imperfect, leading to ambiguous reward functions and misaligned robotic behavior. This paper, ‘Robots That Know What to Ask: Recovering Misaligned Rewards through Targeted Explanations’, introduces a novel framework wherein robots actively identify underspecified task features and proactively request targeted corrective demonstrations from users. By leveraging statistical variation across demonstrations, the approach infers which features require further clarification, and then uses natural language to query for specific examples. Could this ability for robots to ‘know what to ask’ unlock more robust and reliable learning from human guidance in complex, real-world scenarios?

The Inherent Ambiguity of Demonstrated Intent

The success of learning from demonstration-where an agent acquires skills by observing an expert-hinges on the demonstrator’s ability to transfer task knowledge. However, effective communication during demonstration isn’t always explicit; crucial nuances, like subtle timing, force application, or prioritization of features, are frequently left unstated. This reliance on implicit knowledge presents a significant challenge, as the learning agent must infer these unarticulated aspects of the task. While a human observer might readily grasp these subtleties through contextual understanding, an artificial intelligence requires mechanisms to detect and interpret the gaps in communicated intent. Consequently, robust imitation learning necessitates systems capable of not only replicating observed actions, but also discerning the underlying rationale and unspoken principles guiding those actions – a process that moves beyond mere behavioral cloning towards a deeper understanding of the demonstrated skill.

A significant obstacle in imitation learning arises when a demonstrator’s actions unintentionally obscure the relative importance of different features within a task. While a human demonstrator intuitively understands which aspects of a situation are critical, this understanding isn’t always transferred during observation. Consequently, the learning algorithm may struggle to discern whether certain features are merely correlated with success or truly causally linked to achieving the desired outcome. This ambiguity effectively creates a ‘valuation problem’ – the system doesn’t know how much weight to assign to each observed feature, leading to suboptimal policies and hindering its ability to generalize to new, slightly different scenarios. The resulting learned behavior can be brittle, failing when faced with conditions where the relative importance of features deviates from what was implicitly demonstrated.

Robust learning systems necessitate the ability to discern unstated assumptions within demonstrated actions. While learning from demonstration excels at transferring explicit procedural knowledge, crucial details regarding why certain actions are taken-the demonstrator’s underlying intent and priorities-often remain unarticulated. These gaps in communicated intent create ambiguity for the learning agent, potentially leading to suboptimal or even incorrect imitation. Consequently, advanced systems are being developed to actively identify these implicit preferences, employing techniques such as inverse reinforcement learning and active questioning to infer the demonstrator’s goals and refine the learning process. Effectively bridging this communicative gap is not merely about replicating observed behavior, but about understanding the reasoning behind it, thereby enabling systems to generalize learned skills to novel situations and adapt to unforeseen circumstances.

Participant demonstration strategies varied, with some focusing on the underspecified feature, others on different aspects, and still others on a combination of both, as evidenced by observations across twelve participants and two tasks per condition.

Detecting Feature Underspecification Through Statistical Rigor

Feature Variability, as utilized in underspecification detection, quantifies the range of values a given feature takes across multiple demonstrations of a task. A high degree of variability in a feature’s values suggests that the demonstrator did not consistently specify that feature during task execution. This inconsistency implies a potential knowledge gap: the system cannot reliably infer the desired value for that feature in novel situations because the demonstrator’s behavior was not sufficiently constrained. Conversely, low feature variability indicates the demonstrator consistently maintained a specific value or range of values, providing a strong signal regarding the desired behavior for that feature.

Bayesian Model Selection is employed to quantify feature variability as an indicator of underspecification by comparing the evidence for models with and without variance in feature values. Specifically, the method calculates the Bayes factor, representing the ratio of marginal likelihoods for models that allow feature variation versus those that assume fixed values. A significantly higher Bayes factor for the variable model suggests that the observed feature variability is unlikely to have arisen by chance, indicating the demonstrator did not consistently define that feature’s value across demonstrations. This statistical comparison provides an objective measure for identifying features where additional information is required for complete task specification, without the need for manual annotation of data.

Current approaches to identifying underspecification in demonstration data typically require human annotation to label ambiguous or missing information; however, our method circumvents this limitation through automated knowledge gap detection. By analyzing feature variability directly from demonstration data using Bayesian Model Selection, the system identifies features exhibiting high degrees of change without requiring pre-labeled examples. This automated process reduces the dependence on costly and time-consuming manual labeling, enabling scalability and adaptability to new tasks and datasets without the need for extensive human intervention in the knowledge acquisition phase.

Figure 9:Per-participant demonstration feature values across the three user-study conditions. Each column shows one task-relevant feature; rows correspond to the two experimental tasks, and the highlighted subplots mark the feature that theExplanationcondition explicitly named as underspecified. Higher feature values correspond to behavior better aligned with the objective. Boxes show the median and IQR, white diamonds mark group means, individual dots mark per-participant means, and the gray lines connect each participant’s three repeated measures across conditions.

Formalizing Demonstrator Rationality Through Mathematical Modeling

Our approach models the demonstrator’s behavior using ‘Boltzmann Rationality’, assuming demonstrations are generated proportionally to their exponentiated ‘Reward Function’.

The Rationality Vector, denoted as β, is a crucial component in quantifying the consistency with which a demonstrator optimizes each feature within a given state space. Specifically, β is a vector where each element represents the weight assigned to a corresponding feature, reflecting the demonstrator’s prioritization. Higher values in β indicate a stronger tendency to optimize that particular feature when making decisions, while lower values suggest a lesser degree of consistent optimization. This vector is learned from observed demonstrations and provides a measurable insight into the demonstrator’s implicit preferences and the relative importance they place on different aspects of the environment or task.

The feature mapping, denoted as φ, is a crucial component of the model, responsible for converting raw state observations into a fixed-length vector representation. This transformation is necessary as the reward function and subsequent rationality modeling operate on quantifiable feature values, not directly on the original state space. φ effectively defines which aspects of the state are considered relevant and how they are numerically encoded. The output of φ, the feature vector, serves as the input to the reward function, determining the estimated reward associated with a given state, and subsequently informs the Boltzmann rationality model by providing the basis for evaluating demonstrator preferences.

Only providing explanations during reward learning consistently improved performance, as demonstrated by a median increase in normalized reward across 100 repetitions and two tasks, while ablating explanation weighting had no significant effect.

Targeted Guidance: Optimizing Learning Through Focused Demonstration

Targeted Demonstration represents a novel strategy for refining artificial intelligence systems when faced with ambiguous or incomplete task specifications. Rather than providing broad examples, this method focuses new demonstrations specifically on the areas where the system exhibits uncertainty or misunderstanding. By precisely addressing these ‘underspecified features’, the system receives focused guidance, allowing it to learn more efficiently and effectively. This approach avoids the inefficiencies of general demonstrations, which may include redundant information, and instead delivers tailored instruction that directly resolves the system’s knowledge gaps, ultimately leading to improved performance and alignment with intended goals.

The newly developed ASQ framework demonstrably enhances reward learning processes, yielding a normalized reward improvement of 0.021. This result represents a statistically significant advancement when contrasted with both Unguided (-0.018) and Rollout (-0.016) conditions, as confirmed by a p-value of less than 0.001. The observed improvement suggests ASQ facilitates more effective learning signals, allowing agents to more quickly and accurately optimize behavior for desired outcomes. This quantitative gain underscores the framework’s potential for applications requiring robust and efficient reward-based learning, providing a substantial step forward in the field of reinforcement learning algorithms.

User studies corroborate that this approach yields substantial improvements in reward alignment. Participants consistently indicated a stronger correlation between the intended goals of the system and the rewards it ultimately pursued, suggesting a reduction in unintended behaviors or reward hacking. These evaluations, conducted with diverse user groups, revealed a statistically significant preference for systems trained with this targeted demonstration framework compared to baseline conditions. The observed alignment not only enhances the practical utility of the system but also fosters greater trust and predictability in its performance, paving the way for more reliable and human-compatible artificial intelligence.

GridRobot learns despite initial demonstrations lacking complete specification of task features.

The pursuit of robust robotic learning, as demonstrated by the ASQ framework, necessitates a rigorous approach to identifying and resolving ambiguity in human instruction. This aligns perfectly with Donald Knuth’s observation that “Premature optimization is the root of all evil.” The ASQ system doesn’t simply attempt to make a robot learn from imperfect data; it actively seeks clarification on underspecified aspects of a task – a form of analytical ‘optimization’ of the learning process itself. By prioritizing the accurate reconstruction of the intended reward function through targeted questioning, the system ensures that subsequent learning isn’t built upon a flawed foundation, mirroring the elegance of a mathematically sound algorithm.

Where Do We Go From Here?

The framework presented addresses a practical difficulty-the inherent ambiguity in human demonstration. However, it merely shifts the burden of precision. While the robot now articulates what is unclear, the resolution remains contingent upon a human providing a logically consistent answer. The true challenge lies not in identifying ignorance, but in constructing a system capable of discerning fundamentally ill-posed problems-situations where any human response would introduce further inconsistencies. A robot that can recognize a task is, in principle, unsolvable would be a noteworthy achievement, a demonstration of genuine intelligence rather than skillful questioning.

Current implementations rely heavily on Bayesian inference to navigate the space of possible reward functions. While elegant, such approaches are inherently limited by the prior distributions imposed. The selection of these priors represents an assumption, a bias-and a potentially incorrect one. Future work should explore methods for learning these priors directly from data, or, more radically, for operating without them-embracing a degree of ontological uncertainty. A system that acknowledges its own incomplete knowledge, and operates accordingly, would be a significant departure from current paradigms.

Ultimately, the pursuit of inverse reinforcement learning-recovering intent from action-rests on a philosophical assumption: that intent is, in fact, recoverable. It assumes a degree of rationality and consistency in human behavior that may not always exist. The field must confront the possibility that some actions are simply… arbitrary. A truly robust system will not merely learn what a human wants, but whether what a human wants is logically coherent.

Original article: https://arxiv.org/pdf/2605.22986.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

2026-05-25 21:42