Robots That Ask Questions: Bridging Ambiguity with Language and Scene Understanding

Author: Denis Avetisyan

A new framework empowers robots to navigate uncertain environments by combining visual scene analysis with the reasoning power of large language models.

The system enables a robot to interact with its environment by translating observations into a structured scene graph, which the robot then queries to inform its actions or request further clarification-a process acknowledging that even sophisticated reasoning relies on grounded perception and the inevitable need for disambiguation.

SG-CoT integrates scene graphs and chain-of-thought prompting to resolve ambiguities in robotic planning for both single and multi-agent systems.

While large language models show promise in robotic control, their performance suffers when faced with ambiguous instructions or partially observable environments. This paper introduces ‘SG-CoT: An Ambiguity-Aware Robotic Planning Framework using Scene Graph Representations’, a novel approach that leverages structured scene graphs to ground LLM reasoning and iteratively resolve uncertainties. By enabling the LLM to query relevant environmental details, SG-CoT significantly improves task planning accuracy and success rates in both single-agent and multi-agent settings. Could this framework unlock more robust and generalizable robotic systems capable of navigating complex, real-world scenarios?

The Inevitable Messiness of Reality: Robotic Planning and Ambiguity

Conventional robotic planning systems frequently encounter difficulties when applied to real-world scenarios due to inherent ambiguity and incomplete data. These systems are typically designed assuming complete knowledge of the environment and task requirements, a condition rarely met outside controlled laboratory settings. Consequently, robots may exhibit unpredictable or unreliable behavior when faced with uncertain sensor readings, partially observable states, or vaguely defined objectives. This performance degradation stems from the planner’s inability to effectively reason about possibilities and make informed decisions under conditions of uncertainty, highlighting a crucial need for more robust and adaptable planning algorithms capable of navigating imperfect information and achieving reliable outcomes in complex, dynamic environments.

Robotic systems frequently encounter ambiguity not simply from a lack of complete knowledge about their surroundings, but also from the inherent imprecision often found in human communication. Environmental uncertainty – a robot’s inability to perfectly perceive its world through sensors – combines with the challenges of interpreting vague or incomplete directives from users. Consequently, successful robotic planning demands more than just sophisticated algorithms; it requires robust methods for clarification. These systems must be capable of actively seeking additional information, whether through re-querying the user, employing probabilistic models to infer intent, or leveraging prior experience to disambiguate potentially conflicting inputs. Effectively addressing this dual source of ambiguity is paramount for robots operating in real-world scenarios where perfect information is rarely, if ever, available.

The successful integration of robots into complex, real-world scenarios-particularly those involving multiple interacting agents-hinges on overcoming limitations in perception and planning. Dynamic environments, such as bustling city streets or collaborative manufacturing facilities, present a constant stream of incomplete and changing information. Robots operating in these partially observable settings must not only react to unforeseen circumstances but also proactively seek clarification and adapt their plans accordingly. Without robust strategies for handling ambiguity, robotic systems risk failure or, worse, unsafe behavior when encountering unexpected obstacles or conflicting instructions from other agents. Therefore, advancements in planning algorithms that prioritize resilience and adaptability are paramount for realizing the full potential of multi-agent robotic systems and enabling their widespread deployment in increasingly complex and unpredictable environments.

The figure illustrates how ambiguity is resolved by contrasting a global, human-provided instruction with a robot's egocentric perspective and corresponding clarification question. — The figure illustrates how ambiguity is resolved by contrasting a global, human-provided instruction with a robot’s egocentric perspective and corresponding clarification question.

Scene Graphs: Structuring the Chaos

Scene graphs are constructed by Vision-Language Models (VLMs) to provide a structured, machine-readable representation of visual environments. These graphs represent objects as nodes and the relationships between them – such as “on”, “next to”, or “holding” – as edges. This structured format allows LLMs to move beyond simply identifying objects in an image and instead understand the spatial and semantic connections between them, forming a basis for more complex reasoning and interaction with the visual world. The resulting scene graph explicitly defines entities and their attributes, enabling a detailed and contextual understanding of the scene beyond pixel-level data.

Scene graphs facilitate reasoning in Large Language Models (LLMs) by representing visual data as a structured network of objects and their relationships. This structured format moves beyond simple object recognition, enabling LLMs to understand how objects interact within a scene – for example, identifying that a “cat is sitting on a mat” rather than simply recognizing “cat” and “mat” as independent entities. By explicitly defining these relationships – such as spatial positioning (above, below, next to), functional interactions (holding, pushing), or attributes (color, size) – LLMs can perform more complex tasks like answering questions about the scene, predicting future states, or planning actions based on the observed environment. This grounded understanding, derived from the scene graph, is crucial for enabling LLMs to move beyond purely textual reasoning and engage with the physical world represented in visual data.

The construction of accurate scene graphs relies heavily on object detection and relationship prediction capabilities provided by methods like Grounding DINO and Vision-Language Models (VLMs). VLMs such as Gemini-2.5-Flash and Qwen3-VL-2B-Instruct are specifically utilized for this purpose, translating visual input into structured, relational data. Performance metrics demonstrate the efficacy of these models; notably, Gemini-2.5-Flash achieves a 99.0% success rate in generating scene graphs even when provided with incomplete or underspecified instructions, indicating a robust capacity for inference and contextual understanding during scene graph construction.

SG-CoT: Asking the Right Questions

SG-CoT addresses limitations in robotic planning by combining scene graphs with large language model (LLM) reasoning. Traditional LLM-based planners often struggle with ambiguous instructions or incomplete environmental understanding. SG-CoT mitigates these issues by representing the environment as a scene graph – a structured representation of objects and their relationships – which is then processed by the LLM. This integration allows the LLM to ground its reasoning in a concrete understanding of the scene, enabling more accurate interpretation of instructions and improved plan generation. By explicitly leveraging a structured scene representation, SG-CoT enhances the LLM’s ability to resolve ambiguities and generate robust plans, ultimately leading to improved performance in complex planning tasks.

SG-CoT leverages large language models (LLMs) to perform reasoning directly on scene graph representations of environments. This enables the system to identify instances of ambiguity or insufficient information needed for task planning. When uncertainty is detected, the LLM formulates targeted ‘Clarification Question’ prompts based on the scene graph’s contents. These prompts are designed to elicit specific details from a user or external source, effectively resolving ambiguities before a plan is generated. The LLM’s ability to interpret the scene graph allows it to ask questions that are contextually relevant to the environment and the task at hand, rather than relying on generic or pre-defined queries.

SG-CoT enhances planning reliability by actively querying for clarification when encountering ambiguous information within a scene. This is achieved through the implementation of ‘Clarification Question’ prompts, which allow the Large Language Model (LLM) to request specific details before formulating a plan. Evaluation demonstrates a 10% improvement in Correct Question Rate-the percentage of questions accurately identifying necessary clarifying information-when compared to the previous state-of-the-art Inner Monologue approach. This increased accuracy in identifying ambiguities directly contributes to the generation of more robust and dependable plans.

Validation and Benchmarking: Numbers Don’t Lie (Usually)

SG-CoT performance evaluation relies on quantitative metrics, primarily ‘Success Rate’ and ‘Correct Question Rate’, assessed across a range of robotic task evaluations including the LEMMA Benchmark. Success Rate quantifies the percentage of tasks completed successfully according to predefined criteria, while Correct Question Rate measures the accuracy of the system’s internal questioning process used for task decomposition and planning. These metrics provide a standardized method for comparing SG-CoT against baseline and state-of-the-art approaches in both simulated and real-world robotic scenarios, facilitating objective performance analysis and identification of areas for improvement.

Evaluations indicate that the SG-CoT framework demonstrates improved performance compared to conventional methods when processing ambiguous instructions and navigating complex environments. Specifically, experimental results show a 4% increase in overall success rate within single-agent scenarios. Performance gains are more substantial in multi-agent settings, with SG-CoT achieving a 15% improvement in success rate compared to baseline approaches. These gains suggest SG-CoT’s ability to better interpret and execute tasks even with incomplete or unclear information and in situations involving multiple interacting agents.

Performance evaluations indicate that Gemini-2.5-Flash achieved an 86.5% Success Rate when resolving ambiguities related to the environment, demonstrating a high degree of contextual understanding. In multi-agent scenarios, Gemini-2.5-Flash recorded a 75% Success Rate. For comparison, Qwen3-VL-2B-Instruct achieved a 59% Success Rate in the same multi-agent settings. These results, derived from benchmark testing, quantify the relative performance of each model in complex task execution requiring coordination and disambiguation.

Integration of the ‘SayCan’ and ‘SayPlan’ methodologies provides further validation of the SG-CoT framework’s capabilities in challenging robotic environments. ‘SayCan’ enables the agent to assess the feasibility of potential actions given its capabilities and the current environment state, while ‘SayPlan’ facilitates the creation of executable plans based on linguistic instructions. Specifically, these methods address scenarios involving dynamic conditions and partial observability, where complete information about the environment is unavailable. Successful implementation with ‘SayCan’ and ‘SayPlan’ demonstrates SG-CoT’s ability to reason about action possibilities and formulate effective strategies even with incomplete or changing environmental data, thereby enhancing robustness and adaptability in real-world applications.

Towards a More Realistic Robotics: Embracing the Mess

The capacity of Self-Grounding Chain-of-Thought (SG-CoT) to navigate ambiguous situations and actively request clarification represents a substantial advancement for robotic applications. This ability moves beyond simply executing pre-programmed instructions, allowing robots to function more effectively in real-world scenarios characterized by incomplete or uncertain information. In human-robot collaboration, SG-CoT enables more natural and intuitive interactions, as the robot can ask for guidance when faced with unclear commands or unexpected events. Similarly, in autonomous navigation, the system can query its environment – perhaps requesting a re-scan of an area or seeking confirmation of a perceived object – ensuring safer and more reliable operation, even in dynamic and unpredictable settings. This proactive approach to uncertainty significantly broadens the scope of tasks robots can undertake independently and collaboratively.

The system’s proactive capabilities stem from its integration of sophisticated uncertainty estimation methods, notably Conformal Prediction and CLARA. These techniques allow the robotic system to not merely act on data, but to assess the reliability of that data before committing to an action. Conformal Prediction provides statistically valid guarantees on the accuracy of predictions, flagging potentially erroneous outputs, while CLARA – a method for calibrating uncertainty – refines these estimations to be more precise. By quantifying its own confidence, the system can intelligently request clarification when faced with ambiguous situations, re-evaluate its approach, or defer to human oversight, ultimately preventing errors and enhancing robustness in unpredictable environments. This preemptive error management is crucial for safe and reliable operation in complex real-world scenarios.

The development of robotic systems capable of thriving in real-world scenarios demands a departure from rigidly programmed behaviors towards adaptability and resilience. This research signifies a crucial step in that direction, outlining a pathway to create robots that don’t merely execute pre-defined tasks, but intelligently navigate uncertainty. By prioritizing operation within complex, dynamic environments – those characterized by unpredictable changes and incomplete information – these systems promise to move beyond controlled laboratory settings. The resultant robots will be equipped to not only perceive and react to unforeseen circumstances, but also to proactively assess their own limitations and seek clarification when necessary, ultimately enabling reliable performance across a broader spectrum of real-world applications and fostering genuine autonomy.

The pursuit of elegant robotic planning frameworks invariably leads to a familiar outcome. SG-CoT, with its integration of scene graphs and large language models to address ambiguity, feels…inevitable. It’s a sophisticated attempt to impose order on the chaos of real-world perception, to anticipate every edge case before production inevitably reveals a dozen more. As David Hilbert observed, “We must be able to answer every question that can be formulated.” They’ll call it AI and raise funding, of course, but the core problem remains: the world is stubbornly resistant to perfect representation. This framework attempts to iteratively resolve uncertainties – a noble goal, until someone introduces a reflective surface or an oddly shaped box. It’s just a more complex bash script, really, waiting to be broken by a slightly unexpected input.

What’s Next?

The enthusiasm for grafting large language models onto robotic control systems continues, predictably. SG-CoT, with its scene graph scaffolding, represents a step toward acknowledging that robots, unlike the models themselves, encounter genuine ambiguity. The system manages uncertainty through iterative querying, which is a polite way of saying it asks the model again and again until it gets an answer that doesn’t immediately cause a collision. It works… in simulation. The inevitable degradation upon deployment in a dynamic, poorly-lit, and thoroughly illogical warehouse remains to be seen.

The paper correctly identifies partial observability as a core challenge. However, the current approach treats ambiguity as a knowledge gap to be filled, rather than an inherent property of the world. A more robust framework might embrace uncertainty, planning for likely failures and building in graceful degradation. Better one robust, if somewhat slow, robot than a hundred that confidently walk into walls.

Future work will undoubtedly focus on scaling these systems to more complex environments and multi-agent scenarios. The implicit assumption – that more data and larger models will resolve all ambiguity – feels… optimistic. Perhaps a more fruitful avenue lies in accepting the limits of prediction and prioritizing verifiable, low-risk actions. The logs, as always, will tell the tale.

Original article: https://arxiv.org/pdf/2603.18271.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/