Give Your GUI a Hand: Flow-Based Models Master Screen Control

Author: Denis Avetisyan

Researchers are leveraging the power of flow-based generative models to create AI agents capable of surprisingly dexterous interactions with graphical user interfaces.

The system, ShowUI-π, produces remarkably fluid and human-like trajectories across diverse applications, consistently adhering to specified paths and demonstrating an ability to navigate complex instructions with precision.

ShowUI-ππ demonstrates continuous control for GUI automation, achieving state-of-the-art performance on the new ScreenDrag benchmark.

Achieving truly human-like automation demands agents capable of nuanced, continuous interaction, a capability hindered by current GUI automation’s reliance on discrete click predictions. To address this, we present ShowUI-π: Flow-based Generative Models as GUI Dexterous Hands, introducing a novel approach leveraging flow-based models for continuous control of graphical user interfaces. Our method demonstrates superior performance on a newly introduced benchmark, ScreenDrag, outperforming existing agents – including proprietary solutions and Gemini-2.5-CUA – while maintaining a compact model size of only 450M parameters. Will this advance in continuous control unlock a new era of intuitive and efficient human-computer interaction?

The Illusion of Seamless Control

Conventional graphical user interface (GUI) automation typically handles user inputs as isolated events – a click happens, then a drag, each treated as a distinct command. This segmented approach creates a barrier to truly natural interaction, as it fails to recognize the inherent continuity in many human actions. Consider drawing a line or making a fine adjustment to an image; these aren’t series of individual clicks, but fluid motions. By categorizing clicks and drags as separate entities, existing automation systems struggle to replicate the seamlessness of human control, often resulting in jerky, imprecise, or inefficient performance when confronted with tasks demanding nuanced manipulation and continuous feedback.

Current GUI automation systems often falter when faced with tasks demanding fine motor control. Treating actions like drawing, sculpting, or even precise slider adjustments as simply a series of clicks and drags introduces noticeable discontinuities, hindering natural interaction. These discrete approximations struggle to replicate the fluidity of human input, resulting in jerky movements and inaccurate outcomes. The limitations become particularly evident in creative applications or simulations where subtle variations are crucial; a digital artist, for example, relies on continuous pressure and velocity to define brushstrokes, a nuance lost when those actions are broken down into isolated events. Consequently, achieving genuinely intuitive control requires a fundamentally different approach, one that seamlessly integrates both discrete and continuous action types.

The development of genuinely intuitive artificial agents hinges on a consistent action representation, moving beyond the traditional separation of discrete actions – like clicks – and continuous control – such as dragging. Current systems often treat these as fundamentally different operations, creating a barrier to natural interaction and limiting an agent’s ability to seamlessly handle tasks demanding fine motor skills or nuanced adjustments. A unified framework proposes to model all actions along a single spectrum, allowing agents to interpret and execute both point-and-click selections and fluid, continuous movements with equal facility. This consistency not only simplifies the agent’s internal logic but also facilitates more effective learning from human demonstrations, ultimately enabling more natural and efficient human-agent collaboration.

ShowUI-π leverages a large language model with cross-attention to an action expert, enabling the generation of unified action sequences that seamlessly handle both discrete clicks and continuous drag gestures.

ShowUI-π: Stitching the Discrete and Continuous Together

ShowUI-π addresses the limitations of existing GUI automation agents by providing a unified framework for handling both discrete actions, such as mouse clicks, and continuous actions, like dragging. Traditional agents typically treat these action types separately, requiring distinct logic for each. This agent is specifically designed for continuous trajectory control, enabling it to generate smooth and natural interactions by treating all user interface manipulations as continuous trajectories in a shared action space. This unified approach simplifies the agent’s architecture and improves its ability to handle complex, multi-step interactions that combine both discrete and continuous elements, resulting in more robust and versatile GUI control.

ShowUI-π employs a pre-trained vision language model, SmolVLM, to interpret user intent from visual input. This model is integrated with a Transformer architecture which processes the visual information and generates corresponding action sequences. The Transformer utilizes self-attention mechanisms to weigh the importance of different visual features, enabling the agent to effectively map user actions to specific GUI elements and their associated functionalities. The combination of SmolVLM and the Transformer allows ShowUI-π to understand complex user requests and translate them into a series of executable actions within the graphical user interface.

Flow Matching is employed by ShowUI-π as a method for generating action trajectories, offering advantages in stability and precision compared to traditional methods. This technique frames the control problem as a diffusion process, learning to reverse the diffusion and directly generate optimal action sequences. Specifically, it defines a stochastic differential equation (SDE) that gradually transforms a simple distribution into the desired action trajectory distribution. By learning to reverse this process, the agent can generate smooth, natural-looking interactions and reliably execute complex tasks requiring precise control, minimizing jitter and ensuring consistent performance across various GUI elements and interaction types.

ShowUI-π enables fine-grained, closed-loop cursor control by processing task queries and visual observations with a VLM to generate intermediate hidden states, which an action expert then uses to predict actions that update the environment.

Rigorous Evaluation: Separating Signal from Noise

The ScreenDrag Dataset was utilized for evaluating ShowUI-π, as it provides a standardized benchmark specifically constructed for assessing GUI agents operating on continuous drag-based tasks. This dataset consists of a curated set of drag interactions designed to test an agent’s ability to accurately and reliably perform actions requiring sustained, precise mouse movements within a graphical user interface. The continuous nature of the tasks differentiates ScreenDrag from benchmarks focused on discrete actions, necessitating evaluation metrics sensitive to trajectory quality rather than simple action completion. The dataset’s design allows for quantitative measurement of performance across a range of drag interaction scenarios.

Performance assessment of ShowUI-π utilized both offline and online evaluation methodologies. Offline evaluation involved analyzing static trajectories generated by the agent, allowing for a detailed examination of path planning and execution without real-time interaction. Complementing this, online evaluation employed a closed-loop interaction paradigm where the agent directly interacted with the environment, providing insights into its adaptability and robustness in dynamic scenarios. This dual approach ensured a comprehensive performance characterization, capturing both the theoretical accuracy of the agent’s planning and its practical efficacy in a live operating environment.

Performance of the ShowUI-π agent was quantitatively assessed using two primary metrics: Trajectory Error and Task Success Rate. Trajectory Error measures the deviation of the agent’s drag path from the ideal path, while Task Success Rate indicates the percentage of drag tasks completed correctly. On the ScreenDrag benchmark, ShowUI-π achieved a Task Success Rate of 26.98%, representing the proportion of instances where the agent successfully completed the designated drag operation as defined by the benchmark’s criteria. These metrics provide objective data for comparing ShowUI-π’s performance against other GUI automation agents and evaluating improvements in future iterations.

The ScreenDrag pipeline automatically generates GUI interaction data by parsing UI elements, proposing drag instructions with expected metadata changes via an LLM, and verifying those changes through trajectory synthesis and rule-based validation.

Beyond the Benchmarks: Implications for a More Automated Future

Evaluations demonstrate ShowUI-π’s robust capabilities, consistently exceeding the performance of established baseline agents-including Gemini-2.5-CUA, OpenCUA, and Operator-across a suite of standardized metrics. This outperformance isn’t limited to a single area; the agent showcases superior results in crucial aspects of GUI automation, indicating a fundamental advancement in its ability to interpret and interact with user interfaces. Specifically, its success rate in completing tasks within a closed-loop environment consistently surpasses that of competing models, highlighting a marked improvement in task completion and reliability. These findings suggest that ShowUI-π represents a substantial step forward in creating agents capable of autonomous GUI control and, consequently, automating a broader range of digital processes.

ShowUI-π demonstrated a statistically significant advantage over the Gemini-2.5-CUA agent in real-time task completion, achieving a 4.8% higher success rate in online, closed-loop interactions. This improvement signifies that ShowUI-π reliably navigates and completes graphical user interface tasks with greater consistency when operating within a dynamic, interactive environment. The closed-loop setup, where the agent receives immediate feedback from its actions, highlights ShowUI-π’s robust ability to adapt and correct course during task execution, leading to a demonstrably higher rate of successful outcomes compared to its competitor and suggesting enhanced resilience in practical application scenarios.

Evaluations reveal ShowUI-π demonstrates remarkable precision in graphical user interface (GUI) automation, achieving an average trajectory error (ATE) of just 159.05 pixels. This metric signifies the average deviation between the agent’s intended cursor path and its actual movement, highlighting its ability to closely follow desired actions. Crucially, the agent attains a trajectory endpoint accuracy of 78.55%, representing the highest performance among all models tested and indicating a strong capacity to reliably reach specified target locations within the GUI. These results suggest ShowUI-π not only navigates interfaces effectively, but also executes actions with a level of accuracy that surpasses existing approaches to automated GUI interaction.

ShowUI-π distinguishes itself through its capacity to seamlessly integrate both discrete and continuous action spaces, a capability that dramatically expands the scope of automatable graphical user interface (GUI) tasks. Traditional automation methods often struggle with the nuanced control required for actions like dragging, scrolling, or precisely positioning elements – movements demanding continuous adjustments. By effectively managing these continuous actions alongside discrete selections, such as button clicks or menu choices, ShowUI-π unlocks the potential to automate previously intractable processes. This versatility extends beyond simple task completion, enabling the agent to handle complex GUI interactions requiring fine motor control and adaptive behavior, paving the way for broader applications in robotic process automation and assistive technologies.

The development of ShowUI-π extends beyond simple performance benchmarks, offering considerable promise for advancements in multiple fields. Accessibility stands to benefit significantly, as the agent’s capacity to navigate graphical user interfaces autonomously could empower individuals with motor impairments to interact with computers more effectively. Simultaneously, the technology facilitates broader applications in robotic process automation, enabling the streamlining of repetitive digital tasks and freeing human workers for more complex endeavors. Perhaps most profoundly, this research informs the design of future user interfaces; by demonstrating an agent’s ability to interpret and respond to GUI elements, developers can create systems that are not only functional but also more intuitive and user-friendly, ultimately bridging the gap between human intention and machine execution.

This data-driven closed-loop online evaluation process allows models to continuously act on observations without requiring complex operating system or software setups, thereby improving reproducibility.

The pursuit of increasingly sophisticated GUI automation, as demonstrated by ShowUI-π, feels predictably iterative. This paper presents a new benchmark and a flow-based model attempting ‘dexterous manipulation’ of screen elements, but one suspects each refinement simply adds another layer to the eventual tech debt. Yann LeCun once stated, “Backpropagation is the dark art of deep learning.” This sentiment applies equally well to crafting increasingly complex agents for tasks like dragging and drawing; the elegance of the architecture will inevitably collide with the messy reality of production environments. The ScreenDrag benchmark may highlight improvements, but it won’t prevent the inevitable ‘we’ll fix it later’ moment when deployed in a real-world application.

The Inevitable Friction

ShowUI-ππ represents a predictable escalation. The pursuit of ‘dexterous’ GUI agents will inevitably encounter the same limitations as any effort to model complex, messy systems. The ScreenDrag benchmark, while a useful artifact, merely defines a local maximum of solvable problems. Production environments do not offer curated trajectories; they offer edge cases, unexpected window arrangements, and the baffling decisions of users. Tests, as always, are a form of faith, not certainty.

The logical extension of this work isn’t simply ‘more drag’. It’s acknowledging the inherent cost of abstraction. Flow-based models offer a compelling structure, but each node represents a potential point of failure. Scaling these systems will demand not just algorithmic refinement, but a ruthless pragmatism regarding what can be automated reliably. The goal shouldn’t be to replicate human dexterity, but to surpass human tolerance for frustrating, repetitive tasks – a significantly lower bar.

Ultimately, the true measure of success won’t be elegant trajectory generation. It will be the number of Monday mornings saved from inexplicable UI failures. One suspects that metric will remain stubbornly elusive.

Original article: https://arxiv.org/pdf/2512.24965.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Seamless Control

ShowUI-π: Stitching the Discrete and Continuous Together

Rigorous Evaluation: Separating Signal from Noise

Beyond the Benchmarks: Implications for a More Automated Future

The Inevitable Friction

See also: