Turning Live Streams into Conversations

Author: Denis Avetisyan

A new AI assistant streamlines live commerce by handling viewer questions and crafting compelling product descriptions on the fly.

The system integrates offline copywriting-leveraging product information, automated generation, and prohibited-term filtering against a structured database-with a real-time, interactive question-and-answer module that responds to live commentary, and an event-driven memory component which asynchronously processes video to accelerate the creation of contextualized captions.

This paper introduces Click-to-Ask, an AI system leveraging multimodal learning and reinforcement learning to provide interactive Q&A and copywriting support for live streaming commerce.

The increasing demands of live streaming commerce often strain broadcasters’ ability to simultaneously prepare compelling content and respond to viewer inquiries. To address this, we present ‘Click-to-Ask: An AI Live Streaming Assistant with Offline Copywriting and Online Interactive QA’, a novel system that leverages multimodal learning and event-based memory to automate copywriting and facilitate real-time question answering. This AI assistant significantly reduces streamer workload by pre-processing product information and enabling quick, informed responses to audience questions during broadcasts. Could this approach unlock new levels of engagement and efficiency within the rapidly evolving landscape of live e-commerce?

The Inevitable Chaos of Live Commerce: Why Old Systems Fail

The dynamic nature of live commerce presents a unique challenge to customer service; viewers pose questions at a relentless pace, expecting immediate and precise answers. Traditional methods, such as relying on human representatives or keyword-based chatbots, quickly become overwhelmed by this volume and complexity. These systems often struggle to keep up with the rapid-fire Q&A, leading to delayed responses, irrelevant information, and ultimately, lost sales. The sheer velocity of inquiries demands a fundamentally new approach capable of processing natural language in real-time and delivering accurate product details before viewer attention wanes, highlighting the critical need for intelligent assistance within the live commerce landscape.

Current live commerce support systems often falter when faced with the dynamic and nuanced questions posed by viewers. These systems typically rely on keyword matching or pre-defined scripts, proving inadequate for understanding the context of a question within the rapidly evolving conversation. Consequently, they frequently deliver irrelevant or incomplete product information, frustrating potential customers and hindering sales. The challenge isn’t simply identifying keywords, but discerning the viewer’s intent – are they comparing features, seeking usage advice, or inquiring about compatibility? Existing retrieval methods struggle to connect these implicit needs with the vast and complex details of available products, creating a significant bottleneck in the live shopping experience and demonstrating a clear need for more intelligent assistance.

The fast-paced nature of live commerce presents a significant challenge: swiftly and accurately addressing viewer inquiries about products. Traditional methods often fall short, unable to keep up with the volume and nuance of questions. To overcome this, a novel system has been developed to directly connect viewer questions with detailed product information. This approach leverages advanced question recognition capabilities, achieving a Question Recognition Accuracy (QRA) of 0.913. This high degree of accuracy ensures that the system correctly identifies the intent behind each question, enabling it to retrieve the most relevant product details and provide viewers with the information they need, ultimately enhancing the live shopping experience and driving conversions.

The demonstrated approach successfully generates visual results on mobile devices.

Building a Reliable Knowledge Base: A Sisyphean Task

A comprehensive and structured product knowledge base is foundational to generating effective responses. Robust Product Information Integration establishes this base by systematically collecting, organizing, and preparing product data for use by downstream applications. This process moves beyond simple data aggregation to create a relational structure, enabling efficient retrieval of specific information and facilitating a nuanced understanding of product features, specifications, and functionalities. The resulting knowledge base serves as the single source of truth, ensuring consistency and accuracy in all automated interactions and reducing the potential for errors stemming from outdated or conflicting information.

The Information Integration Module employs Chain-of-Thought Prompting, a technique that encourages the Qwen3 8B Large Language Model to break down complex information requests into a series of intermediate reasoning steps. This allows the LLM to more effectively extract relevant data, filter out noise, and synthesize cohesive information from multiple sources. The Qwen3 8B model, with 8 billion parameters, provides a balance between performance and computational efficiency for this task, enabling the module to process and integrate information at scale. The prompting strategy guides the LLM to not simply provide answers, but to articulate the reasoning behind its conclusions, enhancing the accuracy and reliability of the integrated knowledge.

Data ingestion utilizes the MinerU Engine coupled with Automatic Speech Recognition (ASR) to process a variety of input formats. The MinerU Engine facilitates the extraction of text and data from complex document types, specifically including multi-formatted PDF files. Simultaneously, the ASR component converts audio files into text, enabling the integration of spoken information into the knowledge base. This combined approach ensures that data from both visual and auditory sources is captured and prepared for further processing by the Qwen3 8B LLM, regardless of the original file type or complexity.

Automated Content Creation: A Necessary Evil

The Offline Copywriting Module utilizes a Large Vision-Language Model to autonomously generate promotional content. This model processes both visual and textual inputs, enabling the creation of materials designed for both aesthetic presentation and informative accuracy. The system is capable of producing a variety of content formats suitable for diverse marketing channels, and is intended to streamline content creation workflows by reducing the need for manual copywriting and design efforts. The model’s architecture allows it to understand the relationship between images and text, ensuring consistent and relevant messaging across all promotional materials.

The implementation of Prohibited-term Purification, utilizing the Qwen3 8B language model, addresses regulatory compliance by actively identifying and removing prohibited terminology from generated content. Prior to integration, content exhibited an average of 4.36 prohibited terms; post-integration, this figure has been reduced to 0.36, representing a significant improvement in adherence to established guidelines. This purification process operates automatically, minimizing manual review and ensuring consistent application of compliance standards across all generated materials.

The Online Interactive Q&A Module utilizes a Click-based Chatbot architecture coupled with Event-based Historical Memory to facilitate real-time responses to user inquiries. This system dynamically accesses and processes information based on both explicit user clicks and a record of preceding interactions, enabling contextually relevant answers. Performance metrics indicate a Response Quality (RQ) score of 0.876, suggesting a high degree of accuracy and user satisfaction in the delivered responses. The combination of click data and historical memory allows the chatbot to move beyond simple keyword matching and provide more nuanced and informative answers.

Event Segmentation: Dividing the Chaos into Manageable Pieces

Streaming Event Segmentation operates by dissecting a continuous live video stream into discrete, manageable units representing logically distinct events. This process isn’t simply temporal chopping; the system analyzes the video data to identify natural breakpoints based on scene changes, object interactions, or other significant occurrences. The resulting segmented stream forms an Event-based Historical Memory, providing a structured record of the video’s content that facilitates efficient indexing, retrieval, and analysis. Each segment, representing a self-contained event, is stored with associated metadata, enabling targeted access to specific moments within the larger video timeline and supporting applications like video summarization and content-based search.

Event segmentation relies on a dual-analysis approach, employing both Optical Flow and ViT Feature Similarity to pinpoint significant moments within a video stream. Optical Flow algorithms track the movement of pixels between frames, detecting abrupt changes indicative of scene transitions or action initiation. Concurrently, ViT (Vision Transformer) Feature Similarity analyzes the semantic content of frames, calculating the similarity between feature vectors extracted from successive frames; a substantial decrease in similarity signals a likely event boundary. Combining these two methods – motion-based detection via Optical Flow and content-based detection via ViT – provides a robust mechanism for identifying both visually dynamic and semantically distinct segments within the continuous video stream.

The Knowledge Extraction Accelerator (KEA) processes segmented video events to produce descriptive captions, thereby populating the Event-based Historical Memory. KEA employs a combination of computer vision and natural language processing techniques to analyze each segment and automatically generate concise and relevant textual descriptions. This captioning process is optimized for efficiency, leveraging hardware acceleration to minimize latency and maximize throughput, enabling real-time or near real-time enrichment of the historical record. The generated captions include object identification, action recognition, and scene descriptions, providing a searchable and accessible metadata layer for each event segment.

The Illusion of Progress: Continuous Refinement and the Endless Cycle

The Click-based Chatbot’s ability to provide helpful and pertinent responses undergoes continuous refinement through Group Relative Policy Optimization. This technique doesn’t simply assess whether a response is correct, but rather evaluates it relative to other potential options, learning to prioritize those most likely to resonate with viewers. By comparing the performance of various response strategies within groups, the chatbot identifies subtle improvements in relevance and accuracy, avoiding the pitfalls of simply selecting the most statistically probable answer. This nuanced approach allows the system to adapt to the dynamic context of live commerce, enhancing user engagement by consistently delivering increasingly appropriate and valuable information.

The Click-based Chatbot’s capabilities are significantly enhanced through its foundation in the Qwen2.5-VL 7B Large Language Model, enabling a dynamic learning process from each user interaction. This allows the chatbot to not only respond to immediate queries, but also to adapt to evolving viewer preferences and identify emerging trends in real-time. Performance metrics demonstrate a substantial improvement over the baseline Qwen2.5-VL 7B; the chatbot achieves a Relevance Quotient (RQ) of 0.876, a marked increase from the baseline’s 0.536, and a Question Resolution Accuracy (QRA) of 0.913, considerably exceeding the initial 0.512. These results highlight the chatbot’s capacity for continuous refinement and its potential to deliver increasingly accurate and engaging experiences.

The chatbot’s performance isn’t static; it’s built upon a cycle of continuous refinement designed to elevate the user experience during live commerce. Each interaction serves as a learning opportunity, allowing the system to adapt not only to individual viewer preferences but also to evolving trends within the live shopping environment. This iterative approach ensures that responses become increasingly relevant and accurate over time, fostering greater engagement and ultimately maximizing the potential for successful transactions. By consistently learning and adapting, the chatbot moves beyond simply responding to queries and actively contributes to a more dynamic and rewarding experience for viewers, solidifying its role as a key component of the live commerce ecosystem – at least, until the next ‘revolutionary’ framework arrives.

The pursuit of seamless live streaming commerce, as detailed in this paper, feels predictably ambitious. Click-to-Ask attempts to automate engagement, leveraging multimodal learning and event segmentation to anticipate viewer questions. It’s a clever system, certainly, but one built on the assumption that anticipating needs is superior to reacting to them. Fei-Fei Li observed, “AI is not about replacing humans; it’s about empowering them.” This rings particularly true here. The system aims to reduce streamer workload, but history suggests production will inevitably uncover edge cases – prohibited terms slipping through purification, unexpected questions derailing the carefully constructed Q&A. It’s a sophisticated framework, destined to become tomorrow’s tech debt, naturally. Still, it functions – for now.

The Inevitable Friction

This work, predictably, solves a problem production will immediately find a way to exacerbate. Reducing streamer workload is admirable, yet the system’s reliance on predefined event segments feels… optimistic. Live commerce isn’t a series of neat boundaries; it’s a controlled chaos. The elegance of offline copywriting will be tested by the sheer volume of edge cases, the improvised pitches, the accidental brand missteps. Expect a rapid accumulation of “proof of life” – unforeseen prompts, corrupted outputs, and the perpetual need for human intervention.

The real challenge isn’t building a smarter assistant, but accepting that no assistant can anticipate every variation in human interaction. The pursuit of multimodal learning, while sound, will inevitably encounter data scarcity for niche products or emergent trends. Prohibited term purification is a temporary truce; adversarial attacks will evolve, and the system’s understanding of nuance will always lag behind the creativity of those attempting to circumvent it.

Future iterations will likely focus on increasingly sophisticated error masking, and strategies to gracefully degrade performance under load. The goal won’t be perfection-that’s a ghost chase-but resilience. The legacy of this work won’t be a flawlessly automated stream, but a collection of carefully managed failures. A memory of better times, perhaps, when the problem seemed simpler.

Original article: https://arxiv.org/pdf/2603.18649.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/