Author: Denis Avetisyan
Researchers have developed a wearable assistant that proactively guides users through procedural tasks using audio and motion sensing.

This work presents a privacy-preserving, edge-computing approach leveraging IMU and audio processing with LoRA finetuning for real-time activity recognition and proactive conversational assistance.
Existing real-time conversational assistants for procedural tasks commonly rely on computationally expensive and privacy-compromising video input. This work introduces a ‘Proactive Conversational Assistant for a Procedural Manual Task based on Audio and IMU’ that uniquely leverages lightweight audio and inertial measurement unit (IMU) data from wearable devices to understand user context and provide step-by-step guidance. Through a novel User Whim Agnostic (UWA) LoRA finetuning method, we demonstrate a >30% improvement in F-score and a 16x speedup, enabling efficient edge device implementation. Could this approach unlock a new paradigm for accessible, privacy-preserving assistance in complex manual procedures?
The Inevitable Shift: From Cloud Dependency to On-Device Intelligence
Contemporary artificial intelligence assistants frequently depend on a connection to remote servers, or the cloud, to process requests and deliver responses. This architecture introduces noticeable latency – a delay between a user’s query and the assistant’s reply – hindering truly seamless interaction. More significantly, reliance on cloud connectivity raises substantial privacy concerns, as sensitive user data must be transmitted and stored on external servers. Each request potentially exposes personal information to interception or unauthorized access, prompting questions about data security and control. The need for faster response times and enhanced privacy is driving the development of alternative approaches that prioritize on-device data processing, keeping information secure and readily available without external dependencies.
Truly intelligent assistance transcends the limitations of reactive systems; it necessitates anticipating user needs before they are even articulated. This shift demands a fundamental change in where artificial intelligence operates. Relying on cloud-based processing introduces unacceptable delays and compromises user privacy, hindering the seamless and immediate support crucial for proactive functionality. Instead, the computational burden must shift to the device itself – wearables, smartphones, and other edge computing platforms. On-device processing enables real-time analysis of sensor data, contextual awareness, and personalized predictions, allowing AI to offer timely suggestions, automate tasks, and provide support without waiting for network connectivity or explicit commands. This move toward decentralized, proactive AI is not simply about convenience; it’s about creating a genuinely intuitive and helpful digital experience that seamlessly integrates into daily life.
The realization of truly intelligent, proactive assistance hinges on deploying sophisticated AI models directly onto wearable devices. This presents a formidable challenge, as these devices are fundamentally resource-constrained – limited in processing power, memory, and battery life. Consequently, researchers are actively pushing the boundaries of edge AI, developing novel techniques in model compression, quantization, and efficient neural network architectures. These innovations aim to drastically reduce the computational demands of AI algorithms without significant performance degradation. Success in this area will unlock a future where personalized assistance is always available, operates with minimal latency, and crucially, preserves user privacy by eliminating the need to transmit data to the cloud for processing.

Decoding Intent: Multi-Sensor Data and Activity Recognition
The system infers user activities by integrating data streams from two primary sensor types: audio and Inertial Measurement Units (IMUs). IMUs, which include accelerometers and gyroscopes, provide data regarding the device’s motion and orientation in space. Simultaneously, audio input captures environmental sounds and user-generated vocalizations. This multi-sensor approach allows for a more comprehensive understanding of user behavior than relying on a single data source, as audio can corroborate or clarify motion-based activity recognition, and vice versa. The raw data from both sensor types is pre-processed and then fed into a dedicated neural network for activity classification.
The system’s core activity recognition capability is implemented through a Neural Network trained on publicly available datasets, specifically SAMoSA (Spontaneous Movement Activity Sensing) and BoxLift. SAMoSA provides data representing a range of daily activities performed in a natural setting, while BoxLift focuses on lifting and carrying actions. Utilizing these datasets allows the network to learn patterns in both accelerometer and gyroscope data from IMUs, as well as acoustic features extracted from audio input, to classify user actions. The network architecture and training parameters are optimized for real-time performance and accurate categorization of activities relevant to the system’s intended applications.
The current Activity Detection F-score is 0.63, indicating moderate performance in identifying user activities. This score is comparatively lower than those achieved for other tracked activities within the system. Analysis indicates the primary limiting factor is the size of the training dataset used for activity recognition; insufficient data volume restricts the neural network’s ability to generalize and accurately classify a wider range of activity instances. Further data collection and augmentation strategies are planned to address this limitation and improve the model’s performance on activity detection tasks.
The system’s ability to integrate audio and Inertial Measurement Unit (IMU) data significantly improves activity detection reliability in challenging acoustic conditions. This multi-sensor approach mitigates the impact of ambient noise and signal interference, allowing for more accurate classification of user actions. Specifically, the combination of auditory cues – such as speech or tool use – with motion data from the IMUs provides redundant information, enhancing robustness. This enables the system to maintain accurate contextual awareness even when audio signals are degraded or obscured, resulting in a more dependable user experience.
Qwen2.5: Tailoring Language Models for Proactive Assistance
The conversational core of the system utilizes the Qwen2.5 series of large language models, chosen for their capacity to facilitate natural language understanding and generation. These models are designed to process and respond to user inputs in a manner that mimics human conversation, enabling engaging and intuitive interactions. Qwen2.5’s architecture allows for complex contextual understanding, supporting multi-turn dialogues and personalized responses. The selection of this model family prioritizes fluency, coherence, and the ability to maintain consistent conversational threads, crucial for a wearable assistive device.
LoRA (Low-Rank Adaptation) and UWA (Unsupervised Weight Adaptation) finetuning techniques are implemented to specifically tailor the Qwen2.5 base models for the constraints of wearable devices and the inherent variability of user interactions. LoRA reduces the number of trainable parameters by learning low-rank matrices that represent changes to the model’s weights, minimizing computational demands and storage requirements. UWA further optimizes the models by adapting weights without requiring labeled data, improving performance in response to diverse and unpredictable user inputs. This combined approach allows for efficient adaptation to the unique characteristics of wearable computing environments and enhances the model’s robustness in real-world scenarios.
Finetuning the Qwen2.5 models yielded a substantial increase in True Negative Rate (TNR), a metric indicating the model’s ability to correctly identify situations where user assistance is not required and, consequently, refrain from generating a response. This improvement demonstrates the model’s enhanced capacity to avoid unnecessary or intrusive interventions. A higher TNR is critical for wearable applications, minimizing distractions and preserving battery life by ensuring the assistant remains silent unless a genuine need for support is detected. The finetuning process effectively taught the model to discriminate between scenarios requiring action and those where maintaining silence is the optimal behavior.
Finetuning the Qwen2.5 models resulted in measurable gains in instruction entailment, a metric assessing the logical connection between user requests and the assistant’s generated responses. Specifically, improved entailment scores indicate the finetuned models are more capable of producing instructions that are directly relevant to and logically follow from the user’s preceding conversation and identified context. This enhancement signifies a move towards higher quality responses, reducing instances of off-topic or unhelpful guidance and increasing the overall utility of the conversational assistant by ensuring generated instructions directly address the user’s needs as understood from the conversation history.
The assistant’s proactive functionality is achieved through contextual awareness derived from two primary data sources: real-time activity recognition and the user’s conversation history. The system identifies the user’s current activity – such as walking, running, or stationary – and uses this information to anticipate potential needs. Simultaneously, the assistant maintains a record of previous interactions, allowing it to understand user preferences and tailor responses accordingly. This combined data enables the assistant to offer relevant suggestions or information without explicit prompts, providing a more intuitive and efficient user experience. The system prioritizes assistance only when deemed necessary based on both activity context and conversational history, preventing unnecessary interruptions.
The Qwen2.5 models are deployed directly on the Qualcomm Dragonwing IQ9 processor to facilitate real-time performance and data privacy. This on-device execution bypasses the need for cloud connectivity, minimizing latency and ensuring consistent functionality even without an internet connection. The Dragonwing IQ9’s architecture provides the necessary computational resources for efficient model inference, enabling the assistant to respond quickly to user requests and maintain a seamless interactive experience. This approach also enhances user privacy by keeping all conversational data and processing localized on the wearable device.
Validation Through Action: The Table Assembly Task and Beyond
The Table Assembly Task served as a key evaluation metric due to its inherent complexity as a procedural manual task. This task requires the accurate execution of multiple, sequentially dependent steps, demanding precise instruction following and error recovery. The assembly process was chosen to simulate a real-world scenario where users require step-by-step guidance, and any deviation from the correct procedure can lead to failure or necessitate corrective action. Performance was assessed by measuring the assistant’s ability to accurately guide a user through each stage of the assembly, track progress, and provide relevant assistance when errors occurred, effectively testing its capabilities in a complex, multi-step environment.
To rigorously evaluate the assistant’s performance in a simulated environment, realistic user activity logs were generated utilizing GPT-4o. These logs detailed a sequence of actions a user might take during a table assembly procedure, including both correct and erroneous steps. The generated logs served as input to the assistant, allowing for assessment of its ability to provide timely and accurate guidance based on anticipated user behavior. By simulating a range of user interactions – encompassing successful step completions, hesitations, and mistakes – we created a robust testing framework independent of direct human evaluation during initial stages of development and iteration.
The Step Tracker component is a critical element in maintaining contextual awareness during task assistance. It functions by continuously monitoring the user’s progress through a predefined procedural manual, identifying the currently active step. This synchronization enables the assistant to deliver guidance specifically relevant to that step, avoiding premature or irrelevant instructions. The component utilizes activity log data – simulating user actions – to determine the current step and proactively offer assistance, ensuring the user remains on track and receives timely support. Accurate step identification is achieved through a combination of activity log analysis and comparison against the procedural manual’s defined steps, thereby maximizing the effectiveness of the assistance provided.
Unsupervised Weak Alignment (UWA) finetuning demonstrably improved the assistant’s ability to accurately retrieve and present crucial procedural instructions. Specifically, UWA training resulted in a statistically significant increase in the recall of key steps within the table assembly procedure, as measured by precision and recall metrics on a held-out validation set. Furthermore, the model exhibited enhanced performance in identifying and responding to simulated user errors, enabling more effective mistake correction dialogues. This improvement is attributed to UWA’s ability to leverage readily available, unlabeled data to refine the model’s understanding of task-relevant information and appropriate response generation, without requiring extensive manual annotation.
Human evaluations of the assistant’s performance during the Table Assembly Task utilized a rating scale of 1 to 4, with 4 representing the highest level of satisfaction. Results from these evaluations consistently yielded high scores across all measured metrics, indicating a satisfactory user experience. Specifically, evaluators reported the assistant’s guidance was generally clear, helpful, and appropriately timed, contributing to successful task completion. The consistency of high ratings suggests the implemented UWA finetuning and Step Tracker component effectively addressed usability concerns and facilitated a positive interaction for users attempting the procedural task.
Evaluation of the assistant during a Table Assembly Task, utilizing generated activity logs and a Step Tracker component, confirms the technical viability of providing proactive, on-device assistance for complex, multi-step procedures. This assistance, enhanced through Unnatural Language Workflow Adaptation (UWA) finetuning, demonstrably improves recall of critical instructions and facilitates effective error correction. Human evaluations yielded high satisfaction ratings, suggesting a pathway towards a more intuitive and seamless user experience for tasks requiring detailed, sequential guidance without reliance on external resources.

Beyond Assistance: Envisioning a Future of Seamless Integration
Continued development centers on broadening the scope of tasks the assistant can perform, moving beyond pre-defined parameters to encompass a more dynamic range of user needs. This necessitates a significant emphasis on robustness; the system will be engineered to gracefully manage unanticipated inputs and novel situations, leveraging techniques like few-shot learning and reinforcement learning from human feedback. Researchers aim to equip the assistant with the capacity to not simply respond to requests, but to proactively interpret context and adapt its behavior, thereby minimizing user intervention and maximizing seamless integration into daily life. The ultimate goal is an intelligent companion capable of handling the unpredictable nature of real-world interactions with increased reliability and efficiency.
Efforts to broaden accessibility are concentrating on streamlining how these complex AI models function on devices with limited processing power and memory. Utilizing the Open Neural Network Exchange (ONNX) format allows the model to operate efficiently across a wider range of hardware platforms, bypassing vendor-specific restrictions. Simultaneously, research into advanced compression techniques-including quantization and pruning-aims to significantly reduce the model’s size without substantial performance degradation. These optimizations are crucial for deploying the technology on smartphones, embedded systems, and other resource-constrained devices, ultimately bringing sophisticated AI capabilities to a far wider audience and fostering truly ubiquitous intelligent assistance.
The convergence of on-device artificial intelligence with cloud computing promises a shift from reactive to proactive assistance. Future systems will not simply respond to commands, but will leverage locally processed data combined with the vast resources of the cloud to anticipate user needs before they are explicitly stated. This synergy allows for personalized experiences, adapting to individual patterns and preferences while benefiting from continuously updated knowledge and complex reasoning capabilities hosted remotely. The result is an intelligent companion capable of streamlining daily tasks, offering timely suggestions, and ultimately empowering individuals to accomplish more through seamless, intuitive interactions – a future where technology fades into the background, augmenting human potential rather than demanding constant attention.
The culmination of this research suggests a paradigm shift in human-computer interaction, moving beyond reactive commands to a proactive and anticipatory relationship with technology. This isn’t simply about faster processing or more accurate responses; it envisions a future where digital tools seamlessly integrate into daily life, understanding user intent and offering assistance before it’s explicitly requested. Such a transition promises not only increased efficiency in task completion but also a fundamental empowerment of individuals, allowing them to focus on creativity and complex problem-solving rather than navigating technological limitations. The potential extends to accessibility, offering intuitive interfaces for users of all abilities, and ultimately redefining the very nature of how people engage with the digital world.
The pursuit of a proactive conversational assistant, as detailed in this work, highlights an inherent truth about all systems: they are temporary accommodations against entropy. The research focuses on computationally efficient methods – LoRA finetuning and edge computing – to preserve functionality over time, but these are merely strategies to delay the inevitable degradation. As Donald Knuth observed, “Premature optimization is the root of all evil,” but equally true is that all optimization is eventually overcome. The system described here, leveraging audio and IMU data, isn’t about achieving perpetual stability, but about extending graceful operation within the constraints of a changing environment and limited resources, acknowledging that even the most elegant design will eventually succumb to the pressures of time.
What Lies Ahead?
This work, while demonstrating a functional confluence of audio, inertial measurement, and localized adaptation, merely postpones the inevitable entropy of any complex system. The current architecture, reliant on finetuned models, carries the weight of the past – every parameter a decision made in a transient data landscape. Future iterations must acknowledge this; true longevity will not come from perpetually larger models, but from architectures that gracefully degrade, that prioritize continued function over absolute fidelity.
The emphasis on edge computing is a strategically sound, if obvious, maneuver. Preserving user privacy is less a feature and more a recognition that centralized data silos are, by their nature, unsustainable. However, the field faces a critical juncture: can robust activity recognition be achieved without an increasingly granular understanding of intent? Or will the pursuit of predictive accuracy inevitably lead to systems that anticipate, and therefore constrain, the very behaviors they are designed to support?
Ultimately, the true measure of success will not be the system’s immediate performance, but its capacity to adapt to unforeseen circumstances. Slow change, informed by continuous observation and iterative refinement, preserves resilience. Every abstraction carries a cost, and the challenge lies in minimizing that cost over the lifespan of the system – a lifespan that, like all things, will ultimately succumb to the relentless march of time.
Original article: https://arxiv.org/pdf/2602.15707.pdf
Contact the author: https://www.linkedin.com/in/avetisyan/
See also:
- MLBB x KOF Encore 2026: List of bingo patterns
- Overwatch Domina counters
- eFootball 2026 Jürgen Klopp Manager Guide: Best formations, instructions, and tactics
- 1xBet declared bankrupt in Dutch court
- Brawl Stars Brawlentines Community Event: Brawler Dates, Community goals, Voting, Rewards, and more
- Honkai: Star Rail Version 4.0 Phase One Character Banners: Who should you pull
- eFootball 2026 Starter Set Gabriel Batistuta pack review
- Gold Rate Forecast
- Lana Del Rey and swamp-guide husband Jeremy Dufrene are mobbed by fans as they leave their New York hotel after Fashion Week appearance
- Clash of Clans March 2026 update is bringing a new Hero, Village Helper, major changes to Gold Pass, and more
2026-02-19 04:27