Author: Denis Avetisyan
Researchers are developing new AI frameworks to bridge the gap between visual understanding and the nuances of online humor.

A novel framework leverages hierarchical reasoning and preference modeling to improve meme generation and align AI outputs with human comedic sensibilities.
Generating genuinely humorous content remains a significant challenge for multimodal models, demanding nuanced reasoning beyond simple image-caption associations. This paper introduces HUMOR, a novel framework detailed in ‘From Perception to Punchline: Empowering VLM with the Art of In-the-wild Meme’, which guides visual language models through hierarchical chain-of-thought reasoning and aligns them with human preferences via group-wise reinforcement learning. By fostering diverse reasoning paths and leveraging comparative judgment, HUMOR demonstrably improves meme quality and offers a generalizable paradigm for human-aligned, open-ended multimodal generation. Could this approach unlock more intuitive and engaging interactions with AI across a wider range of creative tasks?
The Elusive Algorithm of Amusement
The pursuit of artificial humor presents a unique challenge, extending far beyond simple algorithmic pattern matching. While AI can readily identify statistically common comedic structures – a punchline following a setup, for example – replicating genuine amusement requires a deeper comprehension of context, social cues, and the unexpected. Current systems often struggle with ambiguity, failing to recognize irony, sarcasm, or the subtle deviations from norms that frequently underpin humor. True comedic effect stems from a violation of expectation, but an AI must first understand those expectations to skillfully subvert them. This necessitates a move beyond identifying surface-level patterns towards modeling the complex cognitive processes involved in human perception and appreciation of wit, a task that demands advancements in areas like common sense reasoning and emotional intelligence.
Current artificial intelligence approaches to humor often falter because they prioritize surface-level pattern recognition over genuine comprehension of context and intent, a critical deficiency when analyzing something as culturally dependent as memes. These methods typically dissect memes into component parts – image, text, and associated metadata – and attempt to replicate successful combinations, but they struggle to grasp why a particular meme resonates with an audience. The subtle interplay of irony, satire, and shared cultural references – elements crucial to meme success – are frequently missed, leading to outputs that are either nonsensical or simply lack comedic impact. Effectively decoding humor requires an understanding of not just what is being said, but how it is being said, and, crucially, what unstated assumptions and cultural knowledge underpin the joke – a level of nuanced understanding that remains elusive for most current AI systems.

Deconstructing the Comedic Framework: A Hierarchical Approach
Hierarchical Chain-of-Thought (CoT) reasoning is implemented by dissecting meme generation into two distinct stages: template-level intent and context-level grounding. Template-level intent focuses on defining the core communicative purpose of the meme – the underlying message or concept it aims to express – independent of specific imagery or topical references. Context-level grounding then translates this abstract intent into a concrete realization, selecting appropriate visual templates and incorporating relevant contextual information to create a fully formed meme. This separation enables a structured approach, allowing the model to first establish what the meme should convey before determining how to express it, and facilitating more coherent and targeted meme creation.
The methodology employs a two-stage process to enhance meme generation. Initially, the Template-Level Intent is established, focusing on the core communicative goal or abstract concept the meme intends to represent – for example, expressing frustration, celebrating success, or highlighting irony. Subsequently, Context-Level Grounding operationalizes this intent by specifying the concrete details – including imagery, specific phrasing, and relevant cultural references – required to effectively convey the defined intent. This separation allows for independent manipulation of the meme’s underlying message and its surface-level presentation, resulting in more focused and logically consistent outputs compared to approaches that attempt simultaneous generation of both elements.
Decomposition of meme generation into distinct intent and grounding phases facilitates a shift from stochastic meme creation to targeted content production. Prior methods often relied on combining visual and textual elements without a predefined communicative goal, resulting in outputs lacking coherence or relevance. By explicitly defining the intended message – the template-level intent – before determining specific contextual details, the system can prioritize outputs aligned with the desired meaning. This structured approach enables the generation of memes tailored to specific prompts or communicative objectives, increasing the probability of creating impactful and logically sound content rather than relying on chance combinations.

Learning from the Collective Human Palate
Pairwise Preference Learning is utilized to train a Reward Model by presenting human evaluators with pairs of memes and recording which they prefer. This comparative data is then used to optimize the Reward Model, enabling it to predict human judgments of meme quality with increased accuracy. The model learns to assign higher scores to memes that humans consistently prefer over others, effectively capturing subjective qualities like humor and originality through direct comparison rather than absolute scoring. This approach bypasses the need for predefined quality metrics and allows the model to adapt to nuanced human preferences, leading to improved performance in generating engaging content.
The Reward Model utilizes the Expected Borda Count (EBC) to synthesize individual human preferences into a unified ranking of meme quality. EBC is a method of aggregating rankings where each item receives points based on its position in each individual’s preference list; higher ranked items receive more points. Specifically, if n is the number of items being ranked, an item in first place receives n-1 points, second place n-2 points, and so on, down to zero points for last place. The EBC then calculates the expected value of these points across all individual rankings, providing a single, statistically robust score for each meme. This aggregation method effectively captures the collective human judgment, weighting preferences based on ranking position rather than simply counting first-place votes.
The system’s ability to learn human preferences directly correlates to improved meme generation quality, as demonstrated by human evaluation metrics. Specifically, memes generated using this preference-driven approach consistently achieved the highest scores in four key areas: humor, readability, relevance to prompts, and originality. This performance indicates the Reward Model effectively captures nuanced aspects of human comedic taste and translates those learnings into meme content. The resulting memes are not only perceived as funnier but also more coherent, contextually appropriate, and novel, suggesting a significant advancement over models lacking this preference-learning component.

Orchestrating Diversity and Impact: A Systemic Approach
The HUMOR framework represents a complete system designed to generate memes tailored to diverse audience preferences. It achieves this through the implementation of Group-wise Relative Policy Optimization (GRPO), a sophisticated method for refining the meme generation process. Unlike traditional approaches that optimize for individual responses, GRPO focuses on updating the meme generator based on the collective preferences of defined groups. This group-level optimization allows the system to learn which meme styles and content resonate most effectively with specific demographics, fostering greater engagement and relevance. By considering the nuanced preferences of different groups, HUMOR aims to move beyond generic meme creation and deliver content that is both original and highly appealing to its intended audience, ultimately maximizing impact and shareability.
To ensure a stable and controlled learning process, the Group-wise Relative Policy Optimization (GRPO) algorithm incorporates KL Divergence constraints. This technique functions as a regulatory mechanism, preventing the meme generator from undergoing abrupt or radical shifts in its meme creation strategy. By limiting the extent to which the policy can deviate from previous iterations, KL Divergence safeguards against the generation of nonsensical or irrelevant content during training. Essentially, it encourages gradual improvements rather than unpredictable leaps, fostering a more robust and reliable system capable of consistently producing high-quality, diverse, and engaging memes while maintaining a degree of predictability in its creative output.
The meme generation process relies on Vision-Language Models, enabling the creation of visual content from textual prompts. Crucially, the system doesn’t simply prioritize novelty; it actively assesses the diversity of generated memes using a metric called Context-Swap Distance. This approach yielded the highest diversity scores among all tested models, resulting in a stream of original and engaging content. Beyond diversity, the system also maximizes a ‘Human Rate’ – an estimation, derived from the VLM itself, of how closely the generated memes resemble content created by humans. This dual focus on diversity and human-likeness ensures the output isn’t just varied, but also resonates with audience expectations, creating memes that are both creative and relatable.
The pursuit of genuinely open-ended generation, as exemplified by HUMOR’s framework, echoes a sentiment articulated by David Hilbert: “We must be able to answer the question: What are the ultimate foundations of mathematics?” Just as Hilbert sought bedrock principles, this research delves into the foundational elements of humor-preference modeling and hierarchical reasoning-to build systems capable of more than mere imitation. The framework doesn’t simply generate memes; it attempts to understand the underlying principles that make them effective, recognizing that architecture-or, in this case, a generative model-without a robust understanding of its governing principles is ultimately fragile and ephemeral. This pursuit of understanding, even in the seemingly lighthearted domain of meme generation, reveals a commitment to building systems that age gracefully, capable of adapting and evolving beyond initial constraints.
What’s Next?
The pursuit of computational humor, as exemplified by frameworks like HUMOR, reveals a fundamental tension. Each iteration-every commit-is a record in the annals, and every version a chapter in a longer, perhaps unending, saga. The immediate gains in meme generation, while notable, merely postpone the inevitable reckoning with the subjective and ephemeral nature of what humans find funny. Preference modeling, even group-wise, is a snapshot of a moving target, a temporary bulwark against the tide of changing tastes.
The limitations are not technical, precisely. They are inherent to the task. The system addresses open-ended generation, but the ‘open’ itself is an illusion. The space of possible memes is vast, yet constrained by cultural context, historical precedent, and the ever-shifting boundaries of acceptability. Delaying fixes-addressing the nuances of timing, irony, and unexpected juxtaposition-is a tax on ambition.
Future work will likely focus on increasingly sophisticated methods for capturing and adapting to these dynamic preferences. But the deeper question remains: can a system truly understand humor, or merely approximate its surface features? Perhaps the most fruitful avenue lies not in perfecting the punchline, but in accepting the inherent imperfections, the delightful awkwardness, that often lie at the heart of what we find amusing.
Original article: https://arxiv.org/pdf/2512.24555.pdf
Contact the author: https://www.linkedin.com/in/avetisyan/
See also:
- Clash Royale Best Boss Bandit Champion decks
- Mobile Legends January 2026 Leaks: Upcoming new skins, heroes, events and more
- Vampire’s Fall 2 redeem codes and how to use them (June 2025)
- Clash Royale Furnace Evolution best decks guide
- M7 Pass Event Guide: All you need to know
- Clash of Clans January 2026: List of Weekly Events, Challenges, and Rewards
- Brawl Stars Steampunk Brawl Pass brings Steampunk Stu and Steampunk Gale skins, along with chromas
- Best Arena 9 Decks in Clast Royale
- Best Hero Card Decks in Clash Royale
- Mobile Legends: Bang Bang (MLBB) Sora Guide: Best Build, Emblem and Gameplay Tips
2026-01-04 05:38