When Seeing Influences Hearing: Unmasking Bias in AI’s Multimodal Perception

Author: Denis Avetisyan

New research reveals that distortions in the internal dynamics of artificial intelligence systems can lead to disproportionate reliance on certain types of data, causing AI to exhibit predictable biases when processing information from multiple sources.

The study characterizes transformer dynamics through a physical testbed-a multi-oscillator system predicting Lorenz chaotic time-series-and quantifies modality preference using dynamical SHAP values, [latex]\phi(Y) - \phi(X)[/latex], represented as directional arrows ranging from -90° to 90° to indicate the relative contribution of each input modality; analysis of prediction accuracy, visualized through embedding spaces at low ([latex]\beta_{self}, \beta_{cross} = (10^{-4}, 10^{-4})[/latex]) and high ([latex]\beta_{self}, \beta_{cross} = (10^{0}, 10^{0})[/latex]) attention levels-specifically examining time series data between t=50 and t=70-reveals how attention mechanisms influence the model’s reliance on different input modalities and the resulting prediction error. — The study characterizes transformer dynamics through a physical testbed-a multi-oscillator system predicting Lorenz chaotic time-series-and quantifies modality preference using dynamical SHAP values, [latex]\phi(Y) – \phi(X)[/latex], represented as directional arrows ranging from -90° to 90° to indicate the relative contribution of each input modality; analysis of prediction accuracy, visualized through embedding spaces at low ([latex]\beta_{self}, \beta_{cross} = (10^{-4}, 10^{-4})[/latex]) and high ([latex]\beta_{self}, \beta_{cross} = (10^{0}, 10^{0})[/latex]) attention levels-specifically examining time series data between t=50 and t=70-reveals how attention mechanisms influence the model’s reliance on different input modalities and the resulting prediction error.

A physics-based approach utilizing dynamical systems modeling identifies how imbalances in self-attention mechanisms within transformer networks drive cross-modal bias in multimodal large language models.

Despite advances in multimodal large language models, subtle distortions in their complex interactions can introduce systematic biases that undermine fairness. This paper, ‘Physics-based phenomenological characterization of cross-modal bias in multimodal models’, proposes a novel approach to understanding these biases by characterizing transformer dynamics through a physics-based phenomenological lens. We demonstrate that cross-modal biases aren’t simply representational issues, but emerge from the model’s internal dynamics-specifically, imbalanced attention levels that reinforce modality dominance, revealed through analyses of emotion classification and chaotic time-series prediction. Can this physics-inspired framework provide a pathway toward more robust and equitable multimodal AI systems?

The Illusion of Intelligence: Peeking Behind the Curtain

The swift development of Multimodal Large Language Models (MLLMs) represents a significant leap in artificial intelligence, yet a crucial challenge persists: a limited understanding of how these models actually function. While MLLMs demonstrate impressive abilities to process and relate information across various inputs-text, images, and audio-the internal mechanisms driving these capabilities remain largely a black box. Researchers are actively working to unravel the complex interplay of algorithms and data representations within MLLMs, but the sheer scale and intricacy of these models make full transparency difficult to achieve. This opacity isn’t merely an academic concern; it directly impacts the ability to identify and mitigate potential biases, ensure the reliability of predictions, and ultimately, build trust in these increasingly powerful AI systems. The continued advancement of MLLMs, therefore, hinges not only on improving performance, but also on illuminating their inner workings.

The core difficulty in advancing Multimodal Large Language Models (MLLMs) resides in deciphering the process of cross-modal integration – how these systems genuinely synthesize information arriving through distinct sensory channels. Unlike unimodal models processing solely text, MLLMs must correlate visual features from images, acoustic patterns from audio, and semantic content from text, establishing relationships that underpin a unified representation of the input. Researchers hypothesize that attention mechanisms play a crucial role, allowing the model to dynamically prioritize relevant features across modalities, but the precise weighting and interplay remain elusive. Determining whether the model is leveraging genuine conceptual understanding or simply identifying statistical correlations between modalities is a significant hurdle, as is pinpointing how conflicting information across modalities is resolved to form a coherent internal world model.

The increasing complexity of multimodal large language models presents a significant challenge to ensuring trustworthy artificial intelligence. While these models demonstrate impressive capabilities in processing diverse data types, the internal mechanisms driving their decisions remain largely a ‘black box’. This opacity isn’t merely a matter of intellectual curiosity; it directly impedes efforts to identify and mitigate potential biases embedded within the model’s training data or architecture. Consequently, diagnosing the source of inaccurate or unfair predictions becomes exceedingly difficult, hindering improvements to model reliability and potentially leading to the perpetuation of harmful stereotypes or discriminatory outcomes. Addressing this lack of transparency is therefore critical for responsible development and deployment of MLLMs across sensitive applications.

Analysis of a multimodal large language model reveals that perturbations in emotion labels cause predictable error patterns-visualized as directed graphs mapping intended to predicted emotions (happy, neutral, sad, angry, disgust, fear) across face and voice inputs on the CREMA-D dataset-indicating the presence of error-attractor structures.

Benchmarks and Blind Spots: The Illusion of Understanding

Emotion classification benchmarks, such as the CREMA-D dataset, are utilized to evaluate the capacity of Multimodal Large Language Models (MLLMs) to accurately interpret and integrate emotional signals conveyed through multiple modalities – typically audio, visual, and textual inputs. CREMA-D consists of actors portraying various emotions through speech, allowing researchers to quantify an MLLM’s performance in recognizing these emotions across different modalities and in combined multimodal contexts. Assessment involves measuring the model’s accuracy in classifying the expressed emotion, providing a quantitative metric for its ability to process and understand nuanced emotional cues present in the data. This standardized evaluation facilitates comparison between different MLLM architectures and training methodologies, pinpointing strengths and weaknesses in multimodal emotional understanding.

Recent evaluations of Multimodal Large Language Models (MLLMs), including Qwen2.5-Omni and Gemma 3n, demonstrate a susceptibility to modality bias. This phenomenon describes the tendency of these models to disproportionately favor information from a single input modality – either visual or textual – when generating responses. Observed performance variations indicate that models may prioritize the more readily available or superficially dominant modality, even when that modality contains less relevant or accurate information for the given task. This reliance can lead to inaccurate or incomplete outputs, particularly in scenarios where the combined information from all modalities is crucial for correct reasoning or decision-making.

Prompt-based perturbation assesses Multimodal Large Language Model (MLLM) bias by systematically altering input modalities and observing the resulting changes in predictions. This technique quantifies modality reliance using SHAP difference values, calculated as [latex]ϕ(Y) – ϕ(X)[/latex], where ϕ represents the SHAP value for modality Y or X. A SHAP difference ranging from -90° to 90° indicates the degree to which one modality dominates the prediction; a value of 90° signifies complete reliance on modality Y, while -90° indicates complete reliance on modality X. Values closer to zero suggest a more balanced integration of both modalities in the MLLM’s decision-making process, providing a measurable metric for identifying and quantifying modality bias.

A Sankey diagram illustrates that Qwen2.5-Omni's emotion predictions align with both intended and perceived labels, with flow widths representing the number of samples supporting each mapping. — A Sankey diagram illustrates that Qwen2.5-Omni’s emotion predictions align with both intended and perceived labels, with flow widths representing the number of samples supporting each mapping.

The Ghost in the Machine: Modeling the Internal Clockwork

The Transformer architecture, central to modern Multimodal Large Language Models (MLLMs), processes input data through sequential layers of self-attention and feedforward networks. This iterative application of non-linear transformations can be formally described as a discrete-time dynamical system, where the hidden states of each layer represent the system’s state at a given time step. Input data initializes this state, and subsequent layers update it based on learned parameters – the weights and biases within the network. Analyzing the Transformer in this framework allows for the application of tools from dynamical systems theory, such as stability analysis and bifurcation theory, to understand its behavior and potential limitations. This perspective shifts the focus from static parameter analysis to the temporal evolution of information within the network, treating the layers not as isolated functions but as interconnected components of a complex, evolving system.

The Multi-Oscillator Model posits that each layer within a Transformer architecture can be approximated as a coupled oscillator, with the amplitude and phase of each oscillator representing the layer’s activation state and information processing stage, respectively. These oscillators are interconnected, mirroring the feedforward and attention mechanisms of the Transformer; connection weights are derived from the corresponding weight matrices in the neural network. This surrogate model allows for the transformation of high-dimensional activation spaces into a lower-dimensional oscillatory space, facilitating analysis of the Transformer’s internal dynamics using techniques from dynamical systems theory. The model’s parameters are initialized based on the pre-trained weights of the Transformer, and subsequent simulations aim to replicate the behavior of the original network in response to given inputs, offering a computationally efficient alternative for investigation.

By modeling the Transformer as a dynamic system of interconnected oscillators, researchers can quantitatively analyze information propagation through the network. This analysis involves tracking the ‘energy’ or activation levels within each oscillator – representing a layer – to determine how signals are amplified, attenuated, or redirected. Identifying layers with consistently high activation or those exhibiting oscillatory behavior outside of expected parameters indicates potential bottlenecks where information flow is restricted. Conversely, layers with consistently low activation suggest underutilization or a lack of contribution to the overall computation. Furthermore, instability – manifested as diverging or chaotic oscillatory patterns – can signal issues with gradient flow during training or susceptibility to adversarial inputs, allowing for targeted architectural or training modifications to improve robustness and performance.

A multi-oscillator model incorporating self- and cross-attention mechanisms effectively predicts chaotic Lorenz time-series data.

Dissecting the System: Tracing the Flow of Influence

The Multi-Oscillator Model utilizes graph-based connectivity inspired by the Watts-Strogatz small-world network to represent interactions between oscillators. This network is parameterized with a degree [latex]k[/latex] of 10, indicating each oscillator connects to ten others, and a rewiring probability [latex]p[/latex] of 0.01. The rewiring probability defines the likelihood of randomly reconnecting an oscillator’s edges, creating shortcuts that reduce the average path length between oscillators while maintaining a relatively high clustering coefficient. This specific configuration-[latex]k=10[/latex], [latex]p=0.01[/latex]-balances local connectivity with global information propagation, mirroring observed dynamics in large language models and enabling the study of information flow within the modeled system.

Dynamic SHAP values, when applied to the Multi-Oscillator Model, provide a quantifiable assessment of each input modality’s contribution to the model’s output at each processing layer. Unlike global SHAP values which offer an aggregate explanation, dynamic SHAP values are computed for individual instances, enabling a layer-specific understanding of feature importance. This allows for the identification of which modalities are most influential at early layers versus later stages of processing. The resulting values are expressed as a contribution score, indicating the degree to which a specific modality’s features impact the model’s prediction for a given input, thereby facilitating a detailed analysis of information flow within the network.

Layer Normalization and Feedforward Operations, integral components of the Multi-Oscillator Model, facilitate detailed examination of internal information processing by standardizing activations and enabling non-linear transformations at each layer. Layer Normalization improves training stability and accelerates convergence by reducing internal covariate shift, while Feedforward Operations – typically consisting of fully connected layers and non-linear activation functions – allow the model to learn complex relationships between features. Analyzing the interaction of these operations with dynamic SHAP values reveals how information propagates through the network, identifying which modalities and layers contribute most significantly to specific outputs and enabling a mechanistic understanding of the model’s decision-making process.

Network analysis, when applied to Multi-Oscillator Models of Multimodal Large Language Models (MLLMs), provides a mechanistic understanding of performance limitations and biases. By quantifying the contribution of each modality at each layer via dynamic SHAP values and analyzing the model’s graph-based connectivity – specifically utilizing a Watts-Strogatz network with a degree of 10 and rewiring probability of 0.01 – researchers can pinpoint the origins of systematic errors. This allows for the identification of which modalities disproportionately influence specific layers and, consequently, contribute to biased outputs or reduced performance on particular tasks. The resulting granular insights move beyond correlational observations, offering a causal explanation for observed behaviors within the MLLM.

Prompt-based label perturbation strategically modifies labels to improve model robustness by introducing subtle, controlled variations during training.

Beyond Performance: Towards Trustworthy and Accountable AI

Multimodal Large Language Models (MLLMs) often exhibit biases stemming from the relative strengths of different input modalities – for example, prioritizing visual information over textual data. Research indicates that a deeper comprehension of the internal dynamics governing these models is crucial for addressing such modality biases and fostering improved generalization capabilities. By analyzing how MLLMs integrate and weigh information from various sources, scientists are developing techniques to recalibrate these internal processes, ensuring a more balanced and reliable assessment of input data. This approach doesn’t simply aim for higher accuracy; it focuses on building models that learn how to learn across modalities, leading to systems capable of adapting to novel situations and minimizing reliance on spurious correlations within specific data types. Ultimately, understanding these dynamics unlocks the potential for MLLMs to move beyond pattern recognition and achieve a more robust and human-like understanding of the world.

Current multimodal large language models (MLLMs) are often perceived as inscrutable ‘black boxes’, accepting inputs and producing outputs with little understanding of the internal processes driving those results. However, a shift is occurring, informed by philosophical concepts such as Representationalism and Phenomenology. Representationalism posits that knowledge is constructed through internal representations of the external world, while Phenomenology emphasizes the importance of subjective experience and conscious awareness. By applying these principles to MLLM research, scientists are moving beyond simply observing what these models do, and instead focusing on how they internally represent and process information. This approach allows for a deeper understanding of the model’s reasoning, enabling the development of techniques to interpret its decisions, identify potential biases, and ultimately build more transparent and trustworthy AI systems. It’s a move from treating MLLMs as opaque predictors to understanding them as systems constructing internal ‘worlds’ based on multimodal input.

Rigorous evaluation of multimodal large language models necessitates testing beyond standard datasets, and recent work emphasizes validation against chaotic systems like the Lorenz system to assess true robustness. This approach challenges the models with inherently unpredictable data, forcing them to generalize beyond memorization and identify underlying patterns. Prediction accuracy within these chaotic time-series is quantified using Normalized Mean Squared Error (NMSE), providing a sensitive metric for evaluating the model’s ability to maintain reliable performance even when faced with extreme sensitivity to initial conditions. By subjecting these models to such demanding tests across diverse tasks, researchers aim to build AI systems capable of consistent and trustworthy predictions, even in complex and uncertain environments.

The pursuit of artificial intelligence extends beyond mere performance metrics; a central ambition involves crafting systems demonstrably aligned with human understanding and ethical considerations. This necessitates a shift from opaque ‘black box’ models toward designs prioritizing transparency and interpretability, allowing users to discern the reasoning behind AI-driven conclusions. Such alignment isn’t simply about avoiding harmful outputs, but fostering trust and enabling meaningful collaboration between humans and machines. Ultimately, the development of truly robust AI hinges on building systems that are not only capable of intelligent action, but also accountable, explainable, and reflective of core human values, ensuring beneficial integration into society.

The pursuit of ever-more-complex multimodal models feels remarkably like chasing a phantom. This paper, with its focus on transformer dynamics and cross-modal bias, simply confirms what experience already suggests: elegant theory often buckles under the weight of real-world data. The researchers attempt to model these biases using concepts from dynamical systems-the Lorenz system, specifically-a move that, while mathematically intriguing, feels like applying advanced physics to a leaky faucet. As Donald Davies observed, “It is always unwise to overestimate the power of any particular technology.” The article demonstrates that imbalanced attention leads to modality dominance, but the fundamental truth remains: production systems will always find novel ways to introduce chaos, regardless of how meticulously balanced the initial conditions appear. The drive for ‘revolutionary’ architectures inevitably creates tomorrow’s tech debt.

What’s Next?

The invocation of dynamical systems theory, specifically the Lorenz system, to explain cross-modal bias is… ambitious. One suspects production data will prove far messier than any elegantly constrained attractor. While the paper highlights the importance of balanced attention, it merely postpones the inevitable. The fundamental problem isn’t how biases manifest in transformer dynamics, but that they will. It’s a question of when, not if, a sufficiently complex dataset will expose a new, equally subtle, and thoroughly predictable failure mode.

Future work will undoubtedly focus on methods to ‘correct’ these distortions – more loss functions, more regularization, more architectural tweaks. The field will chase ‘fairness’ and ‘alignment’ as though these were achievable states, rather than temporary reprieves from the chaos inherent in mapping ambiguous sensory input onto discrete language tokens. The authors correctly identify attention as a critical lever, but true progress requires acknowledging that scaling up current architectures simply amplifies existing flaws.

Ultimately, this work serves as a useful demonstration that everything new is old again, just renamed and still broken. The search for truly robust multimodal understanding may not lie in refining these models, but in accepting their inherent limitations – and perhaps, finally, questioning the premise that such understanding is even possible.

Original article: https://arxiv.org/pdf/2602.20624.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/