Can AI Learn to Play Fair? Replicating Human Norms in Multi-Agent Systems

Author: Denis Avetisyan

A new framework allows researchers to rigorously test how AI agents understand and apply principles of fairness by directly translating human subject experiments for artificial intelligence.

NormCoRe frames the emergence of collective normative judgments-like perceptions of fairness-as a translation problem between human and multi-agent AI studies, effectively reconceptualizing replication as a means of mapping how these judgments vary across populations and artificial systems.

This paper introduces NormCoRe, a methodology for replicating human social norms with AI agents and demonstrates its use in studying fairness principles.

Existing approaches to understanding norms in multi-agent AI often implicitly equate human and artificial intelligence, overlooking crucial differences in their decision-making processes. To address this gap, we introduce ‘Normative Common Ground Replication (NormCoRe): Replication-by-Translation for Studying Norms in Multi-agent AI’, a novel methodological framework that systematically translates human subject experiments into AI agent environments. Through replication of a study on distributive justice, we demonstrate that normative judgments in AI agents can diverge from human baselines and are sensitive to model selection and linguistic framing. This work offers a principled pathway for analyzing norms in MAAI, but how can we best leverage these insights to guide the design of ethically aligned and socially beneficial AI systems?

Unveiling the Algorithmic Mirror: Justice as Code

The increasing reliance on algorithmic decision-making necessitates a rigorous focus on fairness, as these systems, despite appearing objective, can inadvertently amplify existing societal inequities. Algorithms are trained on data that often reflects historical biases – prejudices embedded within past decisions and societal structures – and consequently, these biases can become encoded within the algorithm itself. This can lead to discriminatory outcomes in critical areas like loan applications, hiring processes, and even criminal justice, disproportionately impacting marginalized communities. Ensuring fairness isn’t simply about technical accuracy; it demands a proactive approach to identifying and mitigating bias in data, model design, and evaluation metrics to prevent the perpetuation of systemic disadvantages and promote equitable opportunities for all.

Distributive justice, at its heart, addresses how a society’s benefits and burdens are shared amongst its members, moving beyond simple equality to consider equity and need. This principle doesn’t necessarily advocate for everyone receiving the same outcome, but rather that any disparities in resource allocation are justifiable and based on fair criteria. Considerations within distributive justice often involve examining whether opportunities – such as access to education, healthcare, or economic advancement – are proportionally available, and whether systems exist to mitigate disadvantages faced by certain groups. Consequently, a focus on distributive justice requires careful analysis of societal structures and the potential for systemic biases that could lead to unfair or unequal outcomes, prompting a search for mechanisms to ensure a more balanced and inclusive distribution of resources and opportunities.

John Rawls’ “Veil of Ignorance” proposes a thought experiment for establishing principles of justice. It asks decision-makers to design societal structures-and, by extension, algorithms-as if they themselves could occupy any position within that society, unaware of their own future characteristics such as wealth, status, or even abilities. This deliberate lack of self-knowledge compels a focus on maximizing the well-being of the least advantaged, as a rational actor would logically protect themselves against the worst possible outcomes. The framework isn’t about predicting actual impartiality, but rather establishing a procedural benchmark; by removing the influence of personal bias during the design phase, the resulting systems aim to distribute benefits and burdens more equitably, fostering a sense of fairness rooted in universalizability and minimizing the potential for systemic disadvantage.

Group deliberation causes individual preferences for distributive justice to converge toward maximizing average income, regardless of the AI agent's language. — Group deliberation causes individual preferences for distributive justice to converge toward maximizing average income, regardless of the AI agent’s language.

The Shadows of Representation: Limitations in Human Trials

Human subject research, a cornerstone of many fields, frequently encounters limitations stemming from participant pool composition. Obtaining truly representative samples is challenging due to logistical constraints, recruitment biases, and the inherent difficulty in capturing the diversity of the population. These limitations can manifest as over-representation of specific demographic groups, such as students or volunteers with particular motivations, and under-representation of marginalized or hard-to-reach communities. Consequently, findings derived from these studies may not generalize effectively to the broader population, potentially leading to inaccurate conclusions or biased algorithmic outcomes when applied in real-world scenarios. Careful consideration of sampling methodology and potential biases is therefore crucial in designing and interpreting human subject research.

The disproportionate reliance on participants from Western, Educated, Industrialized, Rich, and Democratic (WEIRD) societies represents a significant methodological limitation in algorithmic fairness research. This bias stems from the convenience of accessing these populations, but introduces substantial issues with external validity. Cognitive processes, cultural values, and socioeconomic factors vary considerably across populations; therefore, algorithms trained and evaluated primarily on WEIRD samples may exhibit systematic errors or unfair outcomes when deployed in more diverse contexts. This limits the generalizability of findings and can lead to the development of algorithms that perpetuate or exacerbate existing societal inequalities, particularly impacting underrepresented groups who are not adequately reflected in the training data or evaluation metrics.

Replication studies, defined as independent attempts to reproduce prior research findings using identical or similar methodologies, are essential for establishing the robustness and reliability of algorithmic fairness research. However, these studies face significant barriers due to a lack of dedicated funding mechanisms and institutional incentives. While initial research demonstrating potential fairness issues often receives funding, subsequent attempts to verify those findings, or to assess the generalizability of solutions, are frequently overlooked. This underfunding results in a limited number of replication studies, hindering the accumulation of evidence and potentially leading to the propagation of flawed or non-generalizable results within the field. The consequence is a reduced confidence in reported fairness evaluations and a slower pace of progress towards developing truly reliable and equitable algorithms.

Experimental design, while a robust methodology for establishing causal relationships by manipulating and isolating variables, remains susceptible to the limitations inherent in sample bias. Rigorous controls within an experiment can effectively minimize confounding factors and internal validity threats; however, if the participant pool is not representative of the population to which findings are generalized, the external validity of the results is compromised. Even with statistically significant results obtained through controlled experimentation, conclusions may not accurately reflect performance or outcomes across diverse subgroups or real-world scenarios if the initial sample suffers from selection bias, such as the overrepresentation of specific demographic groups or failure to account for relevant contextual factors.

Analogies between human and AI studies require explicit definition of components-such as fixed turn-taking in discussions-across four translation layers to ensure comparable experimental designs.

Synthetic Minds: AI Agents as Proxies for Population Diversity

Foundation models, typically large neural networks pre-trained on extensive datasets, provide the computational infrastructure for developing AI agents capable of complex reasoning and decision-making. These models, such as those utilizing transformer architectures, acquire a broad understanding of language, concepts, and relationships during pre-training. This pre-existing knowledge is then leveraged through techniques like fine-tuning or in-context learning to instantiate specific agent behaviors. The scale of these models – often containing billions of parameters – allows them to represent and process information in a manner that facilitates nuanced judgment and the application of learned principles to novel situations, forming the basis for autonomous action and problem-solving within defined parameters.

Large language models (LLMs) enable the creation of AI agent personas by leveraging their capacity to generate text contingent on defined parameters. These parameters, input as part of the prompt, specify characteristics such as demographic information, cultural background, values, and behavioral tendencies. The LLM then utilizes these specifications to consistently generate responses and actions aligned with the designated persona, effectively simulating an individual with predictable attributes. This instantiation process allows for the creation of diverse agent populations, each exhibiting unique characteristics, for use in scenarios requiring varied perspectives or representative decision-making.

Prompt engineering is the process of designing and refining textual inputs, known as prompts, to guide the output of large language models (LLMs). Because LLMs generate responses based on statistical probabilities derived from their training data, prompts serve as the primary mechanism for influencing their behavior and directing them toward specific tasks or desired outputs. The precision and structure of a prompt directly impact the quality, relevance, and consistency of the LLM’s response; therefore, careful prompt construction, including the specification of context, constraints, and desired format, is essential for reliable and predictable agent behavior. Iterative refinement of prompts, often involving A/B testing and analysis of generated outputs, is a core component of successfully deploying LLMs as AI agents with defined characteristics and decision-making processes.

The creation of distinct AI agent personas allows for the simulation of diverse perspectives in fairness evaluations. Utilizing the NormCoRe framework, research indicates that groups of these AI agents demonstrated convergence on established fairness principles in 29 out of 33 tested scenarios. This level of agreement is comparable to that observed in 23 out of 34 human groups assessed under the same conditions, suggesting that AI agents can effectively model human reasoning regarding fairness. The ability to systematically generate and evaluate viewpoints through AI agents offers a novel approach to identifying and mitigating potential biases in complex systems.

Analysis of fairness assessments conducted using AI agent groups revealed a disagreement rate of 9.1%, significantly lower than the 20.6% disagreement rate observed in comparable human groups. This indicates a greater degree of homogeneity in the fairness evaluations provided by the AI agents. The observed reduction in disagreement suggests that, under controlled conditions and with defined parameters, AI agents can arrive at more consistent conclusions regarding fairness than human participants, potentially offering a standardized approach to evaluating complex ethical considerations.

The pursuit within NormCoRe-rigorously replicating human studies with AI-mirrors a fundamental act of reverse engineering. It isn’t simply about achieving identical outcomes, but about deconstructing the underlying principles governing human fairness judgments and reassembling them within an artificial system. As Ada Lovelace observed, “The Analytical Engine has no pretensions whatever to originate anything.” This framework doesn’t seek to create fairness, but to meticulously translate and test whether existing norms, as observed in human behavior, can be faithfully reproduced in an artificial agent. The subtle differences highlighted by NormCoRe-accounting for AI ‘decision-making’ versus human cognition-reveal the ‘design sins’ inherent in attempting such a translation, illuminating where the replication falters and demanding deeper analysis of the original system.

Beyond the Baseline

The NormCoRe framework, while a step toward systematizing the interrogation of norms in artificial intelligence, merely exposes the depth of the undertaking. Replication-by-translation forces a valuable confrontation with assumptions embedded within experimental design – assumptions rarely questioned when subjects are human, but critical when applied to systems operating on fundamentally different principles. The apparent success in mirroring fairness judgments isn’t a destination, but a calibration point; a demonstration that mirroring is possible, not that the underlying principles are understood.

Future work must embrace the inevitable divergence. To treat AI agents as imperfect copies of humans is to miss the opportunity to explore the space of possible normative systems. What novel forms of fairness, cooperation, or even conflict emerge when agents aren’t constrained by the quirks of human cognition? The true value lies not in reproducing human bias, but in identifying the architectural constraints that produce any normative behavior, regardless of substrate.

One anticipates challenges in scaling this approach. The translation process, presently a careful manual exercise, demands automation. More fundamentally, the framework tacitly assumes norms are discoverable through behavioral observation. It remains an open question whether certain ethical principles are genuinely ‘emergent’ or are, instead, externally imposed – a distinction that may prove crucial when these agents inevitably reshape the landscapes they inhabit.

Original article: https://arxiv.org/pdf/2603.11974.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Unveiling the Algorithmic Mirror: Justice as Code

The Shadows of Representation: Limitations in Human Trials

Synthetic Minds: AI Agents as Proxies for Population Diversity

Beyond the Baseline

See also: