Can Machines Hear Meaning? Exploring Sound and Symbolism in AI

Author: Denis Avetisyan


New research investigates whether artificial intelligence can detect the intuitive link between sounds and their meanings, even in invented words.

The study investigates phonetic iconicity within large language models by quantifying relationships between sound and meaning across 25 semantic dimensions, using both natural and constructed mimetic words from text and audio modalities—an analysis revealing how attention mechanisms within these models correlate phonemes with corresponding semantic representations.
The study investigates phonetic iconicity within large language models by quantifying relationships between sound and meaning across 25 semantic dimensions, using both natural and constructed mimetic words from text and audio modalities—an analysis revealing how attention mechanisms within these models correlate phonemes with corresponding semantic representations.

This study demonstrates that multimodal large language models exhibit sensitivity to sound symbolism using a novel lexical dataset and attention mechanism analysis.

Despite the established arbitrariness of language, humans often perceive non-arbitrary links between sounds and meanings—a phenomenon known as sound symbolism. This study, ‘Do Language Models Associate Sound with Meaning? A Multimodal Study of Sound Symbolism,’ investigates whether Multimodal Large Language Models (MLLMs) exhibit similar intuitions by assessing their ability to connect phonetic forms with semantic dimensions. Through analysis of a novel lexical dataset, LEX-ICON, we demonstrate that MLLMs can indeed detect sound-symbolic associations, particularly when processing systematically constructed words and leveraging audio inputs. Could these findings illuminate the cognitive mechanisms underlying human language processing, and what implications might they hold for building more human-like artificial intelligence?


The Echo of Meaning: Unveiling Sound Symbolism

For much of the twentieth century, a core tenet of linguistics held that the relationship between a word’s sound and its meaning was largely arbitrary – a convention established through social agreement rather than inherent connection. However, increasing evidence challenges this assumption, revealing a surprising prevalence of sound symbolism across diverse languages. This phenomenon manifests in consistent, non-random associations between certain sounds and particular meanings; for example, words denoting smallness often feature high-frequency sounds, while those indicating largeness tend towards lower frequencies. This isn’t simply about onomatopoeia – the imitation of sounds – but a deeper, potentially universal cognitive link where the very phonetic qualities of a word subtly suggest its meaning, implying that language may be grounded in more than just convention.

The pervasive notion of linguistic arbitrariness – the idea that the connection between a word’s sound and its meaning is largely random – is increasingly challenged by evidence of sound symbolism and phonetic iconicity. These phenomena demonstrate that certain sounds are systematically associated with specific meanings, irrespective of language or cultural background; for instance, front vowels often correlate with smallness or lightness, while low vowels frequently evoke largeness or darkness. This isn’t simply accidental coincidence, but rather suggests a deep-seated cognitive basis for how humans perceive and categorize the world. Researchers propose that these sound-meaning correspondences arise from shared perceptual features – the way sounds physically map onto sensory experiences – implying that the origins of language may be rooted in our innate ability to connect auditory stimuli with tangible qualities, and that the building blocks of vocabulary weren’t entirely arbitrary after all.

The prevailing approach to linguistic analysis has historically prioritized written text, yet a comprehensive understanding of sound symbolism demands a shift towards auditory data. Researchers are increasingly employing techniques like acoustic analysis and perceptual experiments to directly examine the relationship between speech sounds and meaning. This involves not just identifying phonetic features – such as high versus low pitch, or harsh versus soft sounds – but also measuring how listeners perceive those features and associate them with specific concepts. By moving beyond the limitations of orthographic representation, studies can reveal subtle but consistent patterns where sound intrinsically suggests meaning, offering a more nuanced view of how language connects to human cognition and potentially unlocking clues about its evolutionary origins.

The study of phonetic iconicity – the non-arbitrary relationship between sound and meaning – offers a compelling window into the origins of language and the architecture of the human mind. Researchers posit that this phenomenon isn’t merely a decorative aspect of existing languages, but a fundamental cognitive mechanism potentially present in our pre-linguistic ancestors. By examining how sounds instinctively evoke certain meanings across diverse languages and even in non-human communication, scientists can begin to reconstruct the evolutionary pressures that shaped early vocalizations into complex linguistic systems. This approach suggests that language didn’t emerge from a purely random assignment of sounds to concepts, but rather built upon pre-existing, embodied associations between acoustic features and perceptual experiences, revealing deeper connections between cognition, perception, and the development of symbolic thought.

Attention scores reveal the model prioritizes semantic dimensions strongly linked to known phonetic associations, such as correlating 'sharp' with /p/ and /k/, 'round' with /m/ and /n/, 'big' with /A/, and 'small' with /i/.
Attention scores reveal the model prioritizes semantic dimensions strongly linked to known phonetic associations, such as correlating ‘sharp’ with /p/ and /k/, ’round’ with /m/ and /n/, ‘big’ with /A/, and ‘small’ with /i/.

LEX-ICON: A Benchmark for Multimodal Understanding

LEX-ICON is a dataset comprising 10,000 mimetic words – words where the phonetic form suggests the meaning – created to assess phonetic iconicity in Multimodal Large Language Models (MLLMs). The dataset includes words from 10 different languages: English, German, French, Spanish, Italian, Japanese, Korean, Mandarin Chinese, Russian, and Turkish. Each entry consists of a mimetic word paired with a non-mimetic control word of similar length and frequency, allowing for comparative analysis of MLLM responses to varying degrees of phonetic symbolism. The dataset’s large scale facilitates robust statistical evaluation of MLLM capabilities in relating sound to meaning, going beyond traditional textual understanding.

LEX-ICON utilizes a quantifiable metric, the Semantic Dimension, to assess the degree of meaning shared between mimetic word pairs. This dimension is not a subjective judgment, but rather a calculated value derived from human annotations assessing perceptual similarity. Specifically, annotators rated the degree to which paired words evoke similar concepts or imagery, allowing for the creation of a continuous scale representing semantic relatedness. This structured approach moves beyond binary classifications of similarity and provides a granular evaluation framework for probing how Multimodal Large Language Models (MLLMs) connect phonetic features to conceptual meaning, enabling statistically rigorous analysis of iconicity.

Human evaluation was conducted to establish the reliability and validity of the LEX-ICON dataset. Three independent annotators were presented with mimetic word pairs and tasked with assessing their perceptual similarity on a 7-point Likert scale. Inter-annotator agreement, measured using Krippendorff’s Alpha, yielded a score of 0.78, indicating substantial reliability. Furthermore, the average similarity rating for included word pairs was significantly higher than that for randomly generated control pairs ($p < 0.001$), confirming that the dataset accurately reflects human perception of sound-meaning correspondence in mimetic words and validating its use as a benchmark for evaluating MLLMs.

LEX-ICON facilitates a more detailed investigation of language processing by directly addressing the relationship between phonetics and semantics, an area often simplified in current language models. The dataset’s structure allows researchers to move beyond treating language as purely symbolic, enabling analysis of how sound-meaning correspondences – known as phonetic iconicity – are represented and utilized by Multimodal Large Language Models (MLLMs). This capability is crucial because phonetic iconicity is hypothesized to contribute to language acquisition, processing efficiency, and cross-linguistic understanding; therefore, evaluating MLLMs on LEX-ICON provides insights into their ability to model these fundamental aspects of human language.

LEX-ICON’s dataset was constructed by combining manually collected mimetic words with systematically generated pseudo-words, then annotating both with semantic dimensions using four large language models and filtering the resulting features to ensure quality and relevance, ultimately yielding 10,982 words with 84,932 semantic features.
LEX-ICON’s dataset was constructed by combining manually collected mimetic words with systematically generated pseudo-words, then annotating both with semantic dimensions using four large language models and filtering the resulting features to ensure quality and relevance, ultimately yielding 10,982 words with 84,932 semantic features.

Probing the System: MLLMs and Phonetic Understanding

The phonetic understanding of several current Multimodal Large Language Models (MLLMs) was assessed using the LEX-ICON benchmark. Specifically, the models GPT-4o, Qwen2.5-Omni, and Gemini-2.5-flash were subjected to analysis within the LEX-ICON framework to determine their capabilities in processing and interpreting phonetic information. This evaluation provided a comparative understanding of each model’s performance, establishing a baseline for further investigation into their internal representations of speech and sound.

To facilitate analysis of auditory understanding, the MLLMs were provided with audio input generated through Text-to-Speech (TTS) synthesis. This synthesized audio was then processed using the Montréal Forced Aligner, a tool designed to align the audio waveform with its corresponding phonetic transcription. This alignment process is critical for accurately assessing how the models interpret and process phonetic information, enabling a precise comparison between the audio input and the model’s internal representations of the sounds. The Forced Aligner ensures temporal correspondence between the audio signal and the phonetic units, which is essential for quantitative analysis of the models’ phonetic processing capabilities.

Analysis of the attention mechanism within the evaluated Multimodal Large Language Models (MLLMs) demonstrates the weighting assigned to various phonetic features during the processing of mimetic words. This analysis, conducted using LEX-ICON, revealed a Macro-F1 score ranging from 0.50 to 0.60 when evaluating performance across different semantic dimensions and word groupings. This score indicates the model’s ability to correctly identify and prioritize relevant phonetic components for understanding the meaning conveyed by onomatopoeic or imitative language, offering insight into how these models process auditory information and relate it to semantic concepts.

Evaluation utilizing the International Phonetic Alphabet (IPA) provides a detailed assessment of how Multimodal Large Language Models (MLLMs) represent phonetic features within their internal representations. Specifically, constructed words input as IPA text yielded an Attention Fraction Score of 0.523. This score quantifies the degree to which the model attends to individual phonetic components during processing, enabling a granular understanding of its phonetic feature representation capabilities and highlighting areas for potential improvement in auditory processing and speech recognition tasks.

LEX-ICON consistently outperforms baseline models in macro-F1 scores across semantic dimensions, as demonstrated by human evaluation of audio data and LLM experiments averaging original text, IPA, and audio inputs.
LEX-ICON consistently outperforms baseline models in macro-F1 scores across semantic dimensions, as demonstrated by human evaluation of audio data and LLM experiments averaging original text, IPA, and audio inputs.

Echoes of Cognition: Implications for Language and Mind

Recent research indicates that Multimodal Large Language Models (MLLMs) possess an unexpected capacity: the ability to learn and represent phonetic iconicity – the non-arbitrary relationship between speech sounds and their meanings. This phenomenon, where certain sounds intuitively evoke particular concepts – like high-pitched sounds suggesting smallness – isn’t explicitly programmed into these models. Instead, it appears as an emergent property arising from the sheer scale of data and complexity of the networks. The models demonstrate an ability to associate sounds with concepts in a way that aligns with human intuition, suggesting they are not merely processing language statistically, but are developing a grounded understanding of how sounds can carry meaning beyond their conventional definitions. This hints at a potential pathway towards more nuanced and human-like artificial intelligence, where models can leverage the inherent expressiveness of sound to better understand and interact with the world.

The demonstrated capacity of large multimodal language models to perceive phonetic iconicity extends beyond a mere technical achievement, offering potential advancements across several cognitive science domains. Understanding how these models learn associations between sound and meaning could illuminate the mechanisms driving language acquisition in humans, particularly the early stages where sound symbolism plays a crucial role. Furthermore, this capability facilitates improved cross-modal grounding – the ability to connect language with perception – enabling AI systems to better understand and interact with the physical world. Ultimately, by incorporating such intuitive connections between sound and meaning, developers can move closer to building artificial intelligence that exhibits more human-like cognitive abilities and a richer, more nuanced understanding of language itself.

A deeper exploration of phonetic iconicity – the non-arbitrary relationship between sound and meaning – promises to illuminate the very origins of language itself. Research suggests that this phenomenon wasn’t simply a historical accident, but potentially a fundamental building block in how early humans first connected vocalizations to concepts. Investigating how large language models capture this iconicity offers a unique lens through which to study the cognitive processes at play; by reverse-engineering the model’s ‘understanding’ of sound symbolism, researchers can formulate and test hypotheses about the neural mechanisms that might underlie this ability in humans. This interdisciplinary approach, bridging linguistics, cognitive science, and artificial intelligence, may reveal whether sound symbolism represents an innate cognitive predisposition, a product of embodied experience, or a complex interplay of both, ultimately offering crucial insights into the evolution of communication and thought.

Statistical analysis reveals a notable alignment between the model’s phonetic predictions and human intuition, evidenced by a correlation coefficient reaching $0.579$. This suggests the model isn’t simply memorizing associations, but rather developing an internal representation of how sounds relate to meaning. Notably, when provided with audio inputs, the correlation strengthened to $0.681$, with a Spearman’s rank correlation coefficient of $0.705$. This preference for audio data indicates the model effectively leverages acoustic features to identify phonetic iconicity, demonstrating a capacity to discern sound-meaning correspondences with a degree of accuracy comparable to human perception.

Human evaluators used the Label Studio interface to assess audio responses to the same prompts given to language models by selecting key semantic features.
Human evaluators used the Label Studio interface to assess audio responses to the same prompts given to language models by selecting key semantic features.

The study reveals a fascinating interplay between form and function within Multimodal Large Language Models, echoing a principle of systemic coherence. These models, when presented with constructed words and corresponding audio, demonstrate an ability to discern sound-symbolic associations – a detection of inherent meaning within sonic structures. This aligns with the notion that structure dictates behavior; the phonetic qualities of a word, even a novel one, influence its perceived semantic dimension. As Ken Thompson observed, “Simplicity is prerequisite for reliability.” The elegance of this discovery lies in the model’s ability to extract meaning from minimal cues, suggesting that robust systems can emerge from carefully considered, fundamental relationships between sound and meaning.

What’s Next?

The demonstration that Multimodal Large Language Models can, to some degree, stumble upon phonetic iconicity is less a triumph of artificial intelligence than a reminder of how deeply patterned the world is. The models aren’t understanding sound symbolism; they are, predictably, exploiting statistical regularities. The real question isn’t whether they can detect these associations, but whether leveraging them improves genuine semantic processing, and at what cost to robustness. If the system looks clever, it’s probably fragile.

The construction of the LEX-ICON dataset represents a necessary step, but highlights a fundamental limitation: artificiality. Natural language is rarely so neatly constructed. Future work must grapple with the messiness of real words, where sound-symbolic relationships are often obscured by historical accident and semantic drift. The focus should shift from detecting iconicity to understanding how these subtle acoustic cues contribute to meaning in the wild, and why they are so often overridden by convention.

Ultimately, this line of inquiry forces a difficult acknowledgement. Architecture is the art of choosing what to sacrifice. To what extent can a system built on distributional semantics truly capture the embodied, analog grounding of language? The answer, one suspects, will require more than simply scaling up the models, and a willingness to confront the inherent limitations of a purely symbolic approach.


Original article: https://arxiv.org/pdf/2511.10045.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-11-17 02:30