Can AI Understand What’s ‘Polite’? Evaluating Social Reasoning in Visual Language Models

Author: Denis Avetisyan

New research assesses how well artificial intelligence grasps nuanced social expectations from both text and images.

Model accuracy varies across five different approaches to analyzing text and images, as demonstrated by the distribution of results in the comparative box plot.

This review evaluates the capacity of multimodal large language models to reason about social norms, including complex metanorms, and identifies key areas for improvement in agent-based systems and social intelligence.

While artificial intelligence increasingly demonstrates sophisticated reasoning abilities, understanding nuanced social norms-critical for effective multi-agent systems-remains a significant challenge. This is addressed in ‘Social Norm Reasoning in Multimodal Language Models: An Evaluation’, which investigates the capacity of five multimodal large language models to infer social norms from both textual narratives and visual depictions. The study reveals that models like GPT-4o exhibit promising norm reasoning skills, particularly with text, though performance diminishes with complexity and across modalities, and all models struggle with abstract concepts like metanorms. Can these models be further refined to navigate the complexities of real-world social interactions and truly exhibit social intelligence?

The Subtle Dance of Social Understanding

Contemporary multimodal large language models, despite advancements in processing both text and images, consistently demonstrate limitations in grasping the subtleties of human social interaction. These models often falter when presented with scenarios requiring an understanding of normative behavior – the unwritten rules and expectations that govern how people should act. This isn’t merely a failure of recognizing objects or actions; it’s a deficit in inferring why those actions are appropriate, or inappropriate, within a given social context. Consequently, MLLMs frequently misinterpret situations, leading to responses that, while grammatically correct, are socially awkward or illogical, highlighting a significant gap between pattern recognition and genuine social intelligence.

Effective social interaction isn’t simply about identifying familiar patterns; it hinges on a deeper comprehension of unwritten social norms. Humans intuitively grasp expectations governing behavior – cues regarding politeness, reciprocity, and fairness – which often aren’t explicitly stated. This capacity relies on an internal model of how others should act, allowing for predictions and appropriate responses even in novel situations. Failing to recognize these subtle rules can lead to misunderstandings and social friction, highlighting that true intelligence demands navigating not just what happens, but why it happens according to established, yet often implicit, societal guidelines. Consequently, assessing artificial intelligence requires moving beyond simple recognition tasks to evaluate its ability to infer and adhere to these complex, unstated expectations.

Assessing an artificial intelligence’s grasp of social intelligence demands more than simple question-and-answer tests; it requires carefully constructed narratives that emphasize the unspoken expectations within human interactions. Researchers are developing dedicated benchmarks – essentially, story-based evaluations – that present scenarios rich with subtle social cues, like implied obligations or breaches of etiquette. These benchmarks move beyond assessing factual recall to probe whether a system can infer appropriate responses based on understanding the normative aspects of a situation – what is considered polite, expected, or taboo. The success of these systems isn’t measured by identifying objects in an image, but by correctly interpreting the motivations and potential consequences within a complex social context, highlighting a critical step towards truly intelligent machines.

The model demonstrates comparable performance on both text and image data, achieving similar levels of accuracy across modalities.

Evaluating Social Acumen: A Controlled Experiment

The evaluation process involved subjecting four state-of-the-art Multimodal Large Language Models (MLLMs) – Meta LLaMa-4 Maverick, Gemini 2.0 Flash, GPT-4o, and Qwen-2.5VL (72B) – to a series of normative stories. These stories were specifically constructed to present scenarios requiring an understanding of established social conventions. The models’ performance was assessed on their ability to correctly interpret and respond to these situations, providing a quantitative measure of their comprehension of expected behaviors within common social contexts. This testing methodology allowed for a direct comparison of each MLLM’s capacity to process and reason about normative scenarios.

The normative scenarios utilized in this evaluation were constructed to specifically assess an MLLM’s capacity to recognize and interpret widely accepted social conventions. These scenarios presented situations involving common, everyday interactions requiring adherence to established behavioral standards, such as appropriately queuing in a line, arriving at scheduled appointments on time – demonstrating an understanding of punctuality – and refraining from discarding waste improperly – illustrating an avoidance of littering. The consistent presence of these elements allowed for a quantifiable measurement of each model’s ability to correctly identify actions that align with, or deviate from, expected societal behavior.

Evaluations of several large multimodal models (MLLMs) on normative scenarios revealed significant performance differences. GPT-4o achieved the highest overall accuracy, scoring 98.75% on text-based inputs and 92.5% on image-based inputs. This represents a substantial improvement over the other models tested. Among the free-to-use models, Qwen-2.5VL (72B) demonstrated the strongest performance, achieving 97.5% accuracy on text-based inputs, indicating a viable open-source alternative, though with lower overall scores than GPT-4o.

Model accuracies vary across the three categories, as visualized by the interquartile range in the box plot.

Ground Truth: Human Validation of Responses

Human evaluation was employed as the primary method for determining the accuracy of responses generated by Multimodal Large Language Models (MLLMs). This process involved trained human annotators assessing the MLLM outputs against predefined criteria, establishing a ‘ground truth’ dataset. This ground truth served as the benchmark against which automated metrics and different model performances were compared. The subjective nature of many tasks – particularly those requiring social understanding or commonsense reasoning – necessitated human judgment to accurately gauge the appropriateness and correctness of the MLLM’s responses, providing a reliable standard for quantitative analysis and model ranking.

Evaluation of Multimodal Large Language Models (MLLMs) indicated a capacity to recognize and respond appropriately to straightforward social norms, such as offering assistance to elderly individuals or maintaining personal space boundaries. However, performance decreased substantially when presented with scenarios requiring more intricate reasoning or involving ambiguous contextual cues. Models frequently failed to accurately assess the nuances of complex social interactions, demonstrating an inability to generalize basic normative understandings to non-standard or multifaceted situations. This suggests a limitation in their capacity for flexible and context-aware social intelligence.

Analysis of model responses revealed instances of both bias and internal inconsistency in interpreting situational scenarios. Statistical testing, employing a Friedman test, demonstrated a significant difference in performance among the evaluated models (p < 0.001). Specifically, GPT-4o consistently outperformed Gemini 2.0 Flash, Intern-VL3, and LLaMa-4 Maverick, suggesting a greater capacity for consistent and unbiased reasoning; however, all models exhibited limitations necessitating further development of more robust and nuanced reasoning capabilities.

Beyond Recognition: Visualizing Social Context

Researchers employed image generation techniques to translate complex normative stories into accessible comic strip formats, aiming to provide richer contextual cues for machine learning models. This approach moved beyond purely textual descriptions by visually representing scenarios and social interactions, effectively supplementing the narratives with a distinct layer of information. By crafting these visual representations, the study sought to diminish ambiguity and enhance comprehension, allowing models to more readily interpret the nuanced social dynamics inherent in each situation. The resulting comic strips weren’t merely illustrations, but carefully constructed visual aids designed to improve the clarity and interpretability of the underlying ethical and social norms being investigated.

The study harnessed the power of visual representation to facilitate more readily accessible interpretations of complex social scenarios. By translating normative stories into comic strip formats, researchers aimed to bypass potential ambiguities inherent in textual descriptions, thereby offering Multimodal Large Language Models (MLLMs) a clearer foundation for reasoning. This approach acknowledges that visual cues often play a critical role in human social cognition, and seeks to replicate that benefit for artificial intelligence. The expectation is that grounding the scenarios in visual contexts allows the MLLMs to more effectively discern relevant information and draw appropriate inferences, ultimately leading to improved performance in tasks requiring nuanced understanding of social dynamics.

Early findings indicate that supplementing language-based prompts with generated imagery can measurably enhance the performance of large language models, especially when navigating nuanced social situations. The addition of visual data appears to provide crucial context for interpreting ambiguous cues, effectively reducing the potential for misinterpretation. This is particularly relevant in scenarios where unspoken rules or subtle body language heavily influence appropriate behavior; the visual component acts as a clarifying element, allowing the model to better discern expected actions and responses. Consequently, the integration of image generation techniques shows promise for improving the reliability and accuracy of these models in tasks demanding sophisticated social reasoning.

Heatmaps demonstrate performance variations across different questions for both text and image analysis tasks.

The Horizon of Social Intelligence: Beyond Compliance

Advancing artificial intelligence beyond simple norm adherence requires a focus on understanding the underlying rules governing those norms – often referred to as meta-norms. These unwritten rules dictate how norms are enforced, addressing situations where norms conflict or are violated, and fundamentally shaping social order. Research is increasingly directed toward building models capable of recognizing these meta-norms, allowing AI to not just identify acceptable behavior, but also to interpret social cues related to enforcement, reciprocity, and punishment. This ability is crucial for navigating complex social dilemmas, where a rigid application of norms could lead to suboptimal outcomes; instead, a nuanced understanding of meta-normative reasoning will enable AI to adapt its behavior, fostering cooperation and resolving conflicts in a socially intelligent manner.

The development of truly socially intelligent artificial intelligence necessitates a move beyond standardized benchmarks and towards incorporating the nuances of cultural variation and implicit social rules. Current AI models often struggle with behaviors considered commonplace across diverse societies, or those governed by unwritten expectations, because training data frequently prioritizes explicit, universally-defined norms. Expanding benchmark scenarios to include culturally-specific customs – such as differing levels of directness in communication, variations in personal space, or diverse gift-giving etiquette – will be essential. Similarly, incorporating scenarios that demand an understanding of implicit rules – those conveyed through body language, social context, or shared history – will challenge AI to move beyond rote adherence and towards genuine social understanding, ultimately fostering more adaptable and responsible interaction with humans across the globe.

The development of truly socially intelligent artificial intelligence extends beyond simple pattern recognition of established norms; the ultimate aim is to build models capable of reasoning about those norms. This involves not just identifying what is considered acceptable behavior, but understanding the underlying principles and motivations that give rise to those norms, and predicting how those norms might shift in different contexts. Such systems would move beyond rote compliance, demonstrating an ability to navigate complex social dilemmas with nuance and foresight. Crucially, this reasoning capacity is intended to facilitate responsible and ethical behavior, allowing AI to adapt its actions in a manner consistent with human values and societal expectations, even when faced with ambiguous or novel situations.

The study’s exploration of multimodal large language models and their capacity for normative reasoning echoes a principle of elegant design. While current models demonstrate a growing aptitude for deciphering social cues from both image and text, the varying performance with increasingly complex scenarios – such as the nuanced understanding of metanorms – highlights the importance of paring back unnecessary complexity. As Linus Torvalds aptly stated, “Talk is cheap. Show me the code.” This research doesn’t simply showcase potential; it pushes for demonstrable, functional social intelligence within these systems, emphasizing what remains after rigorous testing – a core understanding of social norms rather than a superficial imitation of them.

Where Do We Go From Here?

The exercise, as often happens, reveals more of what is not understood than what is. Demonstrating that a model can identify a violated social norm in a static image, or a plainly worded sentence, is hardly equivalent to social intelligence. It is, rather, a testament to pattern matching. The true difficulty lies not in recognizing the norm, but in understanding why it exists, and when its violation is permissible – or even required. The study of metanorms, those norms governing norms, exposes this limitation acutely. A system can identify a rule being broken, but lacks the capacity to judge the validity of the rule itself.

Future work must move beyond mere identification. The focus should shift toward models capable of constructing internal representations of social contexts, and evaluating actions not simply against a fixed set of rules, but against a dynamic, inferred understanding of the situation. The current reliance on large datasets, while providing breadth, risks reinforcing existing biases and obscuring the subtle nuances of human interaction. A smaller, more carefully curated dataset, designed to specifically challenge a model’s reasoning abilities, may prove more fruitful.

Ultimately, the pursuit of social intelligence in machines is not about replicating human behavior, but about understanding the underlying principles that govern it. If a model cannot explain a social phenomenon simply, the fault does not lie with the phenomenon, but with the model itself. The goal is not complexity, but clarity – a ruthless pruning of unnecessary layers, leaving only the essential mechanisms of social reasoning exposed.

Original article: https://arxiv.org/pdf/2603.03590.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/