Robots Learn to Critique Themselves, Improving Social Skills

Author: Denis Avetisyan


New research demonstrates a system allowing robots to independently assess and refine their behavior, leading to more natural and effective human interactions.

This framework generates robot social behaviors by parsing morphology, planning actions, translating them into joint commands, and iteratively refining performance through evaluation with a vision-language model critic.
This framework generates robot social behaviors by parsing morphology, planning actions, translating them into joint commands, and iteratively refining performance through evaluation with a vision-language model critic.

A novel framework leverages Vision-Language Models for autonomous replanning and refinement of robot social behaviors without requiring human feedback.

Conventional robotic approaches to social interaction struggle with adaptability and often require extensive human guidance. This limitation motivates the research presented in ‘The Robot’s Inner Critic: Self-Refinement of Social Behaviors through VLM-based Replanning’, which introduces CRISP, a framework enabling robots to autonomously critique and refine their behaviors using Vision-Language Models. By leveraging VLMs as a form of ‘inner critic,’ CRISP facilitates the generation of more natural and contextually appropriate motions across diverse robotic platforms without relying on predefined actions or external feedback. Could this paradigm shift toward self-improvement unlock truly intuitive and seamless human-robot collaboration?


The Nuance of Interaction: Beyond Task Completion

For robots to truly integrate into human environments, they must move beyond simply completing tasks and instead demonstrate socially appropriate behaviors. This necessitates a shift from purely functional programming to incorporating principles of social cognition and etiquette. Research indicates that humans subconsciously evaluate robots based on these cues – factors like maintaining appropriate gaze, respecting personal space, and utilizing suitable nonverbal communication dramatically influence acceptance and trust. A robot perceived as socially awkward or insensitive, even if functionally proficient, can elicit discomfort and hinder effective collaboration. Consequently, developers are increasingly focused on equipping robots with the ability to interpret social signals and respond in ways that align with human expectations, fostering more natural and productive interactions.

Historically, robot control has prioritized precise task completion over the subtleties of human interaction. Conventional methods, reliant on pre-programmed sequences and direct kinematic control, struggle to accommodate the unpredictable nature of social exchange. This often results in robotic movements and responses that, while technically accurate, appear rigid, unnatural, or even unsettling to humans. The lack of nuance extends beyond physical movements; traditional systems typically fail to interpret nonverbal cues like facial expressions or body language, hindering their ability to adapt to the emotional state of an interaction partner. Consequently, establishing a truly intuitive and comfortable engagement-one where a human feels understood and at ease-remains a significant challenge, demanding a shift towards more sophisticated, socially aware control architectures.

For robots to truly integrate into human environments, they must move beyond simply executing commands and instead demonstrate an understanding of social dynamics. A believable interaction isn’t about perfect responses, but rather about appropriate ones – recognizing subtle cues like gaze direction, body language, and even variations in vocal tone. Recent research focuses on equipping robotic systems with the ability to interpret these signals, allowing them to adapt their behavior in real-time and avoid socially awkward or disruptive actions. This involves complex algorithms that process multimodal data – combining visual and auditory information – and map it onto a framework of expected social protocols. The ultimate goal is not to mimic human behavior perfectly, but to create a system that can consistently respond in a manner that feels natural and intuitive to human partners, fostering trust and seamless collaboration.

This framework enables robots to generate contextually appropriate responses to human actions-such as waving back at a greeting or clapping at joyful dancing-by iteratively refining low-level joint control code using a visual language model (VLM) based on the robot's structural file.
This framework enables robots to generate contextually appropriate responses to human actions-such as waving back at a greeting or clapping at joyful dancing-by iteratively refining low-level joint control code using a visual language model (VLM) based on the robot’s structural file.

CRISP: A Framework for Iterative Social Awareness

The Critique-and-Replan for Interactive Social Presence (CRISP) framework operates as a closed-loop system wherein robot behaviors are continuously generated, evaluated, and refined. Initial action proposals are created, then subjected to a critique stage assessing social appropriateness and feasibility. This critique informs a replanning process, modifying the original action proposal to address identified issues. The revised plan is then executed, and the resulting state is fed back into the system, initiating a new cycle of critique and replanning. This iterative process allows the robot to adapt its behavior in real-time based on both environmental factors and social context, promoting more natural and effective interaction.

The CRISP framework utilizes Large Language Models (LLMs) and Vision-Language Models (VLMs) in a dual role: action generation and social assessment. LLMs are employed to synthesize potential robot behaviors given a specific context, translating high-level goals into actionable commands. Subsequently, VLMs analyze the proposed action, considering both the robot’s physical execution and the surrounding environment, to determine its social appropriateness. This assessment considers factors such as proxemics, gaze direction, and potential for causing discomfort or disruption. The output of the VLM serves as feedback, enabling iterative refinement of the robot’s behavior to maximize social acceptance and minimize negative impacts on human interaction.

The Robot Structure Analyzer is a critical component of the CRISP framework, responsible for determining the kinematic and dynamic limitations of the robotic platform. This analysis includes assessing joint ranges of motion, link lengths, and maximum velocities and accelerations. By providing a detailed understanding of the robot’s physical embodiment, the Analyzer ensures that generated behaviors, proposed by the LLM and VLM components, are within the robot’s operational space and physically realizable, preventing collisions or unstable movements. The output of the Analyzer is a set of constraints used during the action planning phase to filter and modify potential actions before execution.

Replanning iteratively refines an initial trajectory-highlighted by modifications at steps 2 and 5-based on visual language model (VLM) feedback, as indicated by the reward scores within the revised plan steps.
Replanning iteratively refines an initial trajectory-highlighted by modifications at steps 2 and 5-based on visual language model (VLM) feedback, as indicated by the reward scores within the revised plan steps.

Validating Progress: Empirical Evidence of Enhanced Interaction

The Robot Behavior Generator initiates action planning by leveraging a Large Language Model (LLM) to interpret the social context of a given situation and infer the underlying intent. This process moves beyond simple task execution by incorporating understanding of the environment and potential interactions. The LLM analyzes the contextual cues to formulate a preliminary plan of action, defining a sequence of robot behaviors intended to achieve a socially appropriate outcome. This initial plan serves as the foundation for subsequent refinement and validation stages within the CRISP framework, ensuring that the robot’s actions align with expected social norms and user expectations.

The Motion Evaluator component utilizes a Vision-Language Model (VLM) to determine the social appropriateness of robot actions proposed by the Behavior Generator. This assessment is performed by inputting visual representations of the generated motion – specifically, images – into the VLM, which then provides feedback on whether the action is contextually suitable. This feedback isn’t a simple binary judgment; it provides nuanced data regarding potential social violations or awkwardness, allowing for targeted refinement of the robot’s behavior. The system achieved evaluation using an average of 10.2 VLM input images per action plan.

The Behavior Refiner component modifies the initial action plan generated by the Robot Behavior Generator based on feedback received from the Motion Evaluator. This feedback, assessing social appropriateness, drives adjustments to the robot’s intended actions. The Replanning Cycle then iteratively repeats this process – generating a plan, evaluating it, refining it, and regenerating – allowing the system to continuously improve the social acceptability and overall quality of the robot’s behavior. This iterative loop is central to the CRISP framework’s performance, demonstrated by its superior user preference scores compared to alternative methods.

Evaluations of the CRISP framework demonstrate a substantial improvement in user preference compared to alternative approaches. Specifically, CRISP achieved an average user preference score of 4.5 on a defined scale. This result represents a statistically significant increase over the GenEM framework, which received a score of 3.4, and a baseline system lacking the replanning component, which scored 3.79. These scores indicate that the iterative refinement and validation process, central to CRISP’s design, effectively generates more socially acceptable and preferred robot behaviors as judged by human evaluators.

The CRISP framework demonstrated efficient performance through a limited computational workload. Specifically, the system achieved its results with an average of 10.2 visual input images processed by the VLM for motion evaluation, and required an average of 3.4 iterations of the replanning cycle to refine the robot’s behavior. These metrics indicate a relatively low demand on both visual processing and iterative computation, contributing to the framework’s overall practical viability and scalability.

Generated motions demonstrate that CRISP, with and without replanning, successfully navigates the scenarios, outperforming GenEM as visualized in the results for both the Everyday robot and TIAGo (see Supplementary Video).
Generated motions demonstrate that CRISP, with and without replanning, successfully navigates the scenarios, outperforming GenEM as visualized in the results for both the Everyday robot and TIAGo (see Supplementary Video).

Expanding Horizons: Versatility Across Robotic Platforms

The CRISP framework’s versatility has been demonstrated through successful implementation across a remarkably diverse range of robotic platforms. Testing extended beyond simulated environments to include physical robots such as the Stretch 3, a mobile manipulator designed for complex tasks; the TIAGo, known for its service robotics capabilities; the agile, quadrupedal Unitree G1; and the compact, open-source Open Mini Duck. This broad compatibility signifies a significant advancement, indicating the framework isn’t limited by specific hardware configurations and can be readily adapted to a variety of robotic designs and applications, fostering wider accessibility and potential impact within the robotics community.

The CRISP framework’s remarkable adaptability stems from a core component: the Robot Structure Analyzer. This system automatically extracts crucial kinematic and dynamic information directly from Multi-Joint Control Format (MJCF) files – a widely used standard for describing robot models. By parsing these files, the analyzer efficiently identifies joint configurations, link lengths, masses, and other essential parameters, allowing the framework to rapidly adapt its control strategies to a new robot without requiring manual re-programming or extensive calibration. This automated process significantly reduces the time and effort needed to deploy complex robotic tasks on diverse hardware, fostering a level of platform independence previously challenging to achieve in robotics.

The efficacy of the CRISP framework across varied robotic platforms is significantly bolstered by a robust visual feedback system, intricately linked with a keyframe capture technique. This process doesn’t simply observe robotic actions; it meticulously records crucial moments – keyframes – representing optimal or problematic performance. By analyzing these captured instances, the system can quantify discrepancies between intended movements and actual execution, irrespective of the robot’s physical characteristics or kinematic structure. This data-driven approach facilitates targeted adjustments to control parameters, allowing for rapid optimization and ensuring consistently reliable performance even when transferring complex tasks to entirely new robotic bodies. The resulting iterative refinement, guided by visual analysis of keyframes, underpins the framework’s impressive adaptability and broadens its potential application across a diverse range of robotic systems.

The true strength of the CRISP framework lies not simply in its functionality, but in its remarkable platform independence. Successfully deployed across a spectrum of robotic systems – from the industrial-grade Stretch 3 and TIAGo, to the agile Unitree G1 and compact Open Mini Duck – the system transcends the limitations of specialized hardware. This adaptability isn’t accidental; it’s a core design principle that unlocks the potential for truly widespread adoption. By removing the barrier of platform-specific coding and integration, CRISP significantly lowers the threshold for robotics implementation across diverse fields, including manufacturing, logistics, healthcare, and even domestic assistance. Consequently, this versatility promises to accelerate innovation and expand the reach of robotic solutions to a far broader range of applications and users than previously feasible.

The proposed method successfully generates social behaviors across a diverse range of robotic platforms, including quadrupedal robots like the Unitree G1 and Stretch 3, as well as humanoids such as TIAGo and Open Mini Duck.
The proposed method successfully generates social behaviors across a diverse range of robotic platforms, including quadrupedal robots like the Unitree G1 and Stretch 3, as well as humanoids such as TIAGo and Open Mini Duck.

The pursuit of seamless human-robot interaction, as detailed in this work, often necessitates layers of complexity. However, CRISP demonstrates a compelling counterpoint. By enabling autonomous replanning through Vision-Language Models, the framework prioritizes refinement through subtraction-identifying and correcting suboptimal behaviors without extensive human intervention. This aligns with a fundamental principle articulated by Blaise Pascal: “The eloquence of the body is more persuasive than the eloquence of the tongue.” The robot, freed from superfluous action, communicates intent more clearly through streamlined, purposeful movement – a testament to the power of focused execution over ornate display. The framework’s efficacy rests not on adding more features, but on achieving precision through iterative self-correction.

What Remains?

The pursuit of autonomous social behavior, as exemplified by CRISP, inevitably encounters the limitations inherent in translating human nuance into algorithmic directives. The framework successfully demonstrates self-refinement, yet the very notion of “appropriateness” remains stubbornly subjective. Future work must confront not merely the execution of social behaviors, but the formalization of their underlying ethical and cultural constraints – a task less about computational power and more about philosophical precision. The current reliance on VLMs, while effective, introduces a dependency on models trained on inherently biased datasets; minimizing this influence will demand novel approaches to data curation and model evaluation.

A crucial simplification lies in the current emphasis on reactive replanning. True social intelligence anticipates, not merely responds. Extending CRISP to incorporate predictive modeling – allowing the robot to simulate potential interactions and proactively adjust its behavior – represents a significant, though complex, challenge. The elegance of autonomous refinement should not obscure the fact that this is still, fundamentally, a search for patterns. The interesting questions lie not in replicating observed behaviors, but in navigating the unpredictable spaces between them.

Ultimately, the measure of success will not be the robot’s ability to appear social, but its capacity to facilitate genuine interaction. The framework’s current focus on low-level control, while necessary, risks obscuring the higher-level goals of communication and collaboration. The disappearance of the author, in this context, is not merely a coding ideal, but a testament to a system that truly understands – and respects – the other.


Original article: https://arxiv.org/pdf/2603.20164.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-03-23 09:12