Drawing Together: An AI That Collaborates on Visual Stories

Author: Denis Avetisyan

Researchers have created an embodied AI system capable of partnering with artists to co-create drawings, pushing the boundaries of human-robot creative collaboration.

The system functions as a voice-activated drawing machine, detailed in Appendix A, demonstrating a practical application of speech-based control in automated creation.

This work details the development and evaluation of ‘Companion,’ an embodied AI leveraging large language models and in-context learning for collaborative visual storytelling.

While artificial intelligence is often positioned as a tool for automation, its potential for genuine creative collaboration remains largely unexplored. This paper details the development of ‘An Embodied Companion for Visual Storytelling’, an integrated robotic and large language model system designed to co-create visual narratives with human artists. Through bidirectional interaction and real-time sketching, the system demonstrates a distinct aesthetic identity, validated by art-world experts, and a capacity for synergistic creative expression. Could this approach redefine the roles of artist and machine, unlocking new frontiers in embodied AI and artistic practice?

The Illusion of Agency: Democratizing Art, or Just Automating Skill?

Historically, the creation of art – whether painting, sculpture, or performance – has been fundamentally constrained by the need for extensive human skill and dedicated practice. This reliance on manual dexterity and years of training inherently limits who can fully realize their artistic visions, creating a barrier to entry for many potential creators. Furthermore, even for skilled artists, the physical demands and time investment associated with traditional techniques restrict the scale at which art can be produced. This limitation on both accessibility and scalability hinders widespread creative exploration and the rapid prototyping of artistic concepts, ultimately impacting the evolution of art itself.

An innovative Embodied AI System is presented, designed to fundamentally alter access to artistic creation. By uniting the conceptual power of Large Language Models with the dexterity of robotic hardware, the system transcends the limitations of traditional art forms requiring extensive manual skill. This integration allows users to translate abstract ideas into tangible artworks without needing specialized training in painting, sculpting, or other disciplines. The result is a platform that democratizes artistic expression, empowering a wider range of individuals to explore their creativity and realize their visions through a physical medium – effectively removing barriers to entry and fostering a new era of accessible artistry.

The Embodied AI System fundamentally reimagines the artistic process by dissolving the traditional boundary between idea and execution. Previously, translating a conceptual vision into a tangible artwork demanded significant technical skill and physical dexterity; this system bypasses those requirements, allowing artists to directly manifest their concepts through robotic means. By integrating large language models with precise robotic control, the system facilitates a novel form of artistic collaboration – not between human artists, but between human ideation and automated physical realization. This enables explorations beyond the limitations of human capability, potentially unlocking new aesthetic possibilities and broadening participation in creative endeavors as the system empowers individuals to realize artistic visions regardless of their technical proficiency.

The functionality of this Embodied AI hinges on the meticulous orchestration of robotic components, notably Dynamixel Servos, which provide the nuanced and accurate movements required for physical artistic expression. These servos aren’t simply actuators; they offer positional feedback, enabling the system to ‘know’ where its ‘limbs’ are in space and adjust accordingly – crucial for tasks demanding precision. Managing this complexity is the YARP Framework, a robust software platform designed for robot control and data communication. YARP facilitates seamless interaction between the Large Language Model – responsible for the creative direction – and the physical hardware, translating conceptual commands into concrete actions. This integration allows for dynamic adjustments during the creative process, ensuring the robotic system can respond to evolving artistic intentions and overcome real-world challenges inherent in physical creation.

The robot's attempt to artistically direct a scene, as demonstrated by its elongated arm reaching for the feather, reveals a deviation from the user's intended character repositioning. — The robot’s attempt to artistically direct a scene, as demonstrated by its elongated arm reaching for the feather, reveals a deviation from the user’s intended character repositioning.

From Prompt to Pixel: The LLM as Artistic Director

The Large Language Model (LLM) functions as the core operational unit within the automated drawing system. It receives initial prompts, interprets the desired visual output, and subsequently manages all subsequent stages of artwork creation. This includes selecting appropriate drawing methods from its internal knowledge base, translating those methods into actionable commands for the robotic arm, and iteratively refining the drawing based on internal evaluations or external feedback. The LLM does not simply generate an image; it actively directs the entire physical drawing process, acting as an intermediary between the user’s intention and the final artwork produced by the robotic system. All components – visual vocabulary, drawing methods, and robotic control – are integrated and sequenced by the LLM to achieve the specified drawing task.

In-Context Learning (ICL) enables the Large Language Model (LLM) to modify its drawing style without requiring explicit retraining. This adaptation is achieved by providing the LLM with a set of example drawings during the prompt; these examples demonstrate the desired aesthetic or technique. The LLM then analyzes these provided visuals and their associated instructions – effectively learning the requested style from the context of the prompt itself. This allows for dynamic control over the output, enabling the LLM to generate drawings in a variety of styles based solely on the examples given, rather than being limited to a pre-defined set of capabilities.

In-Context Learning (ICL) utilizes a curated Visual Vocabulary – a database comprising example drawings paired with corresponding Drawing Methods – to direct the Large Language Model’s (LLM) image generation. Each entry within the Visual Vocabulary explicitly links visual characteristics of a drawing to the specific robotic arm instructions, or Drawing Methods, required to reproduce those characteristics. The LLM references this vocabulary during the drawing process, identifying similarities between a desired output and existing examples, and then extrapolates the associated Drawing Methods to create new visual content. This allows the LLM to adapt its drawing style and execute complex visuals without requiring explicit retraining, effectively translating textual prompts into a sequence of robotic arm movements defined within the Visual Vocabulary.

The system utilizes Robot Drawing Tools as an interface between the Large Language Model (LLM) and the robotic arm, translating the LLM’s conceptual drawing instructions into actionable motor commands. These tools facilitate precise control over the arm’s movements, including pen-down/pen-up actions, trajectory planning, and force application, enabling the creation of lines, curves, and shading. Communication occurs via a defined API, allowing the LLM to specify drawing parameters such as line length, angle, and speed. The robotic arm then interprets these commands, executing the desired motions to physically realize the artwork on a drawing surface, with feedback mechanisms ensuring accuracy and consistency throughout the drawing process.

In-context learning enhances drawing stylization for previously seen subjects, with the greatest improvement achieved when drawing methods are included in the example prompts.

Human-Machine Collaboration: A New Artistic Paradigm, or Just a Convenient Illusion?

The system integrates Speech Recognition technology to enable direct voice control of the image generation process, bypassing traditional text-based input methods. This allows users to issue instructions and refine images through spoken commands, facilitating a more fluid and interactive collaborative experience. The Speech Recognition module transcribes audio input into text prompts that are then processed by the Large Language Model (LLM). This reduces the cognitive load on the user, enabling quicker iteration and exploration of creative ideas, and supports real-time adjustments based on spoken feedback. The system currently supports a defined vocabulary of artistic commands and parameters for voice control.

The Large Language Model (LLM) operates under a defined set of System Instructions which establish the parameters for its artistic output. These instructions detail the LLM’s designated artistic persona – specifying characteristics like style or preferred medium – and impose constraints on its generative process, such as limitations on color palettes, subject matter, or complexity. Critically, System Instructions also articulate specific goals for the generated artwork, defining desired narrative elements, emotional tone, or compositional objectives; these goals ensure the LLM’s output remains aligned with the overall creative vision and facilitates directed artistic exploration.

The system facilitates visual storytelling by generating a sequence of drawings that represent a narrative. This is achieved through iterative prompting and refinement, where human input guides the Large Language Model (LLM) to create images that progress a story arc. The LLM doesn’t produce a single image, but rather a Sequential Representation – a series of connected visuals designed to convey a narrative from beginning to end. Each drawing in the sequence builds upon the previous one, establishing continuity and contributing to the overall storytelling effect. The system’s capability extends beyond static image generation to actively support the creation of multi-panel visual narratives.

The generated drawings represent a collaborative artistic process where the Large Language Model (LLM) functions as a generative tool responding to human direction. The LLM’s capabilities are utilized to create visual content, but the ultimate artistic outcome is shaped by human input, which includes prompts, instructions, and iterative refinement. This is evidenced by the drawings’ composition, subject matter, and style, all of which are directly influenced by the human collaborator’s creative choices. The resulting imagery is therefore not solely the product of the LLM, but rather a synthesis of algorithmic generation and human artistic intent, demonstrating a shared creative effort.

Simulated drawings created by the Companion agent using the 'Scribbly' vocabulary demonstrate that incorporating methods significantly improves drawing quality compared to generating them without such methods. — Simulated drawings created by the Companion agent using the ‘Scribbly’ vocabulary demonstrate that incorporating methods significantly improves drawing quality compared to generating them without such methods.

Evaluating the Output: Does Algorithmic Consistency Equate to Artistic Merit?

The generative system doesn’t simply produce images; it crafts drawings possessing a demonstrably unique aesthetic identity. This isn’t a matter of random variation, but rather the consistent expression of a particular sketching style-a discernible set of visual characteristics that define the work. Analysis reveals a cohesive approach to line quality, composition, and subject representation, resulting in outputs that feel unified despite variations in content. This emergent style isn’t explicitly programmed, but arises from the interplay of the system’s underlying algorithms and training data, demonstrating an ability to synthesize and express a novel visual language. The consistency of this style is not merely observational; it’s been confirmed through rigorous evaluation, suggesting the system reliably produces artwork with a recognizable and distinct character.

The artistic merit of the system’s generated drawings underwent a stringent evaluation process utilizing the Consensual Assessment Technique, or CAT. This methodology bypasses subjective opinion by gathering assessments from multiple expert raters, ensuring a statistically robust judgment of quality. Participants, all experienced in visual arts, independently scored the drawings based on aesthetic appeal and artistic skill. The resulting data was then analyzed to determine the degree of consensus among the evaluators; a high level of agreement signifies a clear and reliable assessment of the artwork’s quality, independent of individual biases. This rigorous approach provides compelling evidence supporting the system’s capacity to produce not just technically proficient, but genuinely artistically valuable imagery.

Rigorous evaluation of the system’s artistic output, employing the Consensual Assessment Technique, demonstrates a significant capacity for generating drawings perceived as both visually compelling and artistically meaningful. Experts consistently rated the generated artwork, resulting in a consensus score of 6.0 out of 7 – a strong indication of broad agreement on the aesthetic quality achieved. This score isn’t merely numerical; it represents a judgment from trained observers that the system successfully navigates the complexities of artistic expression, producing work that resonates beyond purely technical proficiency and approaches genuine aesthetic value. The findings suggest the system isn’t simply replicating styles, but rather synthesizing them into novel creations that are demonstrably appreciated by those with discerning artistic eyes.

The system’s creative capabilities are significantly bolstered by integration with the Gemini API, allowing for nuanced content generation and a demonstrably consistent aesthetic style. Expert evaluation of the resulting artwork reveals a low standard deviation of 0.81 in aesthetic ratings, confirming a strong consensus amongst reviewers regarding the distinct visual characteristics of the generated drawings. This level of agreement indicates that the system doesn’t simply produce random outputs, but rather consistently embodies a recognizable and cohesive artistic identity, achieving a reliable standard of quality in its creative endeavors and highlighting the effective synergy between the LLM and the Gemini API.

Progressive iterations of the Gemini model-from Gemini 1.5-flash to Gemini 2.5-pro-preview-demonstrate a clear evolution in drawing capabilities when completing a complex visual task involving artistic interpretation, storytelling, and shading.

The pursuit of collaborative creation, as demonstrated by ‘Companion’, inevitably reveals the chasm between theoretical elegance and practical implementation. The system attempts to bridge human intent with robotic execution, a commendable effort, yet one bound to encounter the limitations of current technology. As Carl Friedrich Gauss observed, “If I speak for my own benefit, I am foolish.” This rings true; the promise of LLMs and robotic art is captivating, but the reality of in-context learning and achieving truly seamless human-robot interaction will likely demand far more refinement. The inevitable quirks and unforeseen challenges aren’t failures, merely indicators that even the most sophisticated framework will eventually confront the messy realities of production use. One anticipates a healthy backlog of edge cases.

Sooner or Later, It Breaks

The pursuit of an ‘embodied companion’ for creative endeavors is, predictably, hitting the edges of what’s actually useful. This work demonstrates a functional system, certainly, but anyone who’s shipped anything knows ‘functional’ is merely the baseline for inevitable chaos. The elegance of Large Language Models directing robotic arms feels… precarious. The system currently relies heavily on carefully curated prompts and relatively constrained drawing tasks. Scaling this to genuinely open-ended visual storytelling? Expect a lot of digital scribbles that resemble abstract frustration. It’s a beautifully complex way to generate slightly unusual doodles.

The real challenge isn’t making the robot draw, it’s making it understand-or at least convincingly simulate understanding-the intent behind the human’s artistic direction. In-context learning, while clever, is still a patch on genuine comprehension. Future work will inevitably involve wrestling with ambiguity, correcting for misinterpretations, and, crucially, building in graceful failure modes. Because if a robotic arm flails wildly during a ‘creative burst’, at least it’s predictably flailing.

Ultimately, this field is building increasingly sophisticated tools for a task humans have managed for millennia. It’s not about replacing artists-it’s about creating a new category of technical debt. It’s important to remember: it’s not creation, it’s just meticulously logged instructions for future digital archaeologists to decipher.

Original article: https://arxiv.org/pdf/2603.05511.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Agency: Democratizing Art, or Just Automating Skill?

From Prompt to Pixel: The LLM as Artistic Director

Human-Machine Collaboration: A New Artistic Paradigm, or Just a Convenient Illusion?

Evaluating the Output: Does Algorithmic Consistency Equate to Artistic Merit?

Sooner or Later, It Breaks

See also: