Seeing is Sorting: AI-Powered Textile Recognition for Automation

Author: Denis Avetisyan

A new approach leverages digital twins and advanced visual AI to dramatically improve the accuracy and efficiency of automated textile sorting systems.

The textile inspection process is formalized as a flowchart, enabling systematic evaluation and categorization of fabric defects to ensure quality control.

This review evaluates nine visual language models within a digital twin framework for garment classification and foreign object detection in robotic textile handling.

Achieving robust automation in textile recycling remains challenging due to the inherent deformability of garments and the need to identify foreign objects in cluttered environments. This work, ‘Digital Twin Driven Textile Classification and Foreign Object Recognition in Automated Sorting Systems’, presents a robotic sorting system integrating a digital twin, multimodal perception, and semantic reasoning via Visual Language Models (VLMs). Benchmarking nine VLMs across 223 scenarios, the Qwen model family consistently achieved the highest accuracy (up to 87.9%) in both garment classification and foreign object detection. Could this approach pave the way for scalable, autonomous textile sorting solutions in realistic industrial settings and contribute to a more circular economy?

The Inherent Challenges of Automated Garment Logistics

The fashion industry’s reliance on manual garment sorting represents a significant operational hurdle, consistently creating bottlenecks within supply chains and driving up labor costs. Despite advancements in logistics, the sheer volume of clothing requiring individual inspection, folding, and categorization necessitates a vast workforce. This process isn’t simply about speed; it demands nuanced judgment to identify garment types, sizes, colors, and potential defects. Consequently, facilities often struggle to keep pace with demand, especially during peak seasons, and are heavily burdened by the expense of maintaining a large, skilled labor pool. The economic implications are substantial, prompting a search for automated solutions that can mitigate these challenges and improve overall efficiency in apparel handling.

Automated garment handling presents a significant challenge to current robotic systems due to the inherent variability in fabric properties, garment shapes, and unpredictable states of presentation. Existing computer vision algorithms often falter when confronted with crumpled, overlapping, or partially obscured clothing, leading to misclassifications and failed manipulations. Furthermore, the ability to reliably detect and avoid foreign objects – such as pens, receipts, or other items inadvertently mixed with garments – remains a critical limitation. While advancements in deep learning have improved object recognition, these systems often require extensive training datasets specific to each garment type and struggle to generalize to unseen variations or unexpected obstructions. Consequently, robust and adaptable vision and manipulation capabilities are essential to bridge the gap between laboratory demonstrations and real-world deployment in dynamic and unstructured environments.

The logistical complexities of garment handling necessitate robotic systems exhibiting a high degree of adaptability. Unlike structured manufacturing environments, clothing presents immense variability in material, shape, and flexibility – a crumpled t-shirt differs drastically from a stiff denim jacket. Consequently, a robot designed for one garment type often falters with another. Furthermore, automated systems must contend with unexpected obstacles – stray tags, misplaced items, or even entirely foreign objects mixed within the clothing stream – which demand real-time replanning and robust error recovery. Achieving true automation, therefore, hinges not merely on identifying and grasping garments, but on developing intelligent systems capable of dynamically adjusting to unpredictable conditions and a constantly changing array of textures, forms, and obstructions.

Achieving fully automated garment handling hinges on a triumvirate of robotic competencies: accurate classification, precise object localization, and reliable grasp planning. Systems must first identify what an item is – a t-shirt versus trousers, for instance – and then pinpoint its exact position and orientation within a cluttered environment. This localization isn’t merely about detecting the garment’s presence; it demands a detailed understanding of its shape, potential deformations, and overlap with other objects. Finally, a robust grasp plan must be formulated, accounting for the garment’s material properties, weight distribution, and the need to avoid damage during manipulation. Without seamless integration of these three capabilities, robotic systems will struggle to consistently and efficiently process the unpredictable variety inherent in real-world garment handling scenarios, limiting their practical application and economic viability.

The models successfully identified garments in a cluttered environment containing additional clothing and distracting items, as demonstrated by their responses aligning with the ground truth.

Vision-Language Models: The Algorithmic Core of Garment Sorting

Visual Language Models (VLMs) facilitate semantic understanding of unstructured garment piles by processing visual data and associating it with learned language representations. This allows the robotic system to not only categorize garment types – such as t-shirts, pants, or socks – but also to identify foreign objects present within the pile, like plastic bottles, metal cans, or other non-garment items. The VLM achieves this through training on large datasets of images paired with textual descriptions, enabling it to recognize objects and their attributes based on visual features and contextual understanding, which then informs the robot’s manipulation and sorting decisions.

Vision Transformers (ViTs) form the foundational architecture for our Visual Language Models (VLMs), processing visual input through a self-attention mechanism that identifies relationships between image patches. This contrasts with convolutional neural networks which rely on local receptive fields. The ViT decomposes an image into a sequence of patches, embedding each patch and feeding the sequence into a transformer encoder. The resulting embeddings represent a global understanding of the image content. These embeddings are then used to generate actionable insights for the robotic system, including garment type classification and the identification of foreign objects within the garment pile, enabling the robot to determine appropriate manipulation strategies.

Our system utilizes Visual Language Models (VLMs) to concurrently address garment classification and the detection of unexpected items within the garment stream. This is achieved through a unified framework where the VLM processes visual input to categorize garments based on predefined classes – such as shirt, pants, or sock – and simultaneously identifies objects that do not belong to these established categories, performing what is known as zero-shot foreign object detection. This capability eliminates the need for prior training data on potential foreign objects; the VLM leverages its understanding of semantic relationships to recognize anomalies based on contextual reasoning. The VLM’s output provides both a classification label for garments and a detection signal indicating the presence of non-garment items, enabling the robotic system to respond appropriately.

Integrating Visual Language Models (VLMs) with robotic manipulation enables a sorting process that dynamically adjusts to variations in input. Unlike traditional robotic sorting systems reliant on pre-programmed parameters and rigid object recognition, VLM integration allows the robot to interpret visual data and contextual cues to identify and categorize garments and foreign objects without requiring explicit retraining for each new item. This adaptability stems from the VLM’s ability to generalize from learned concepts, facilitating successful manipulation of previously unseen objects and handling variations in garment appearance, pose, and pile configuration. Consequently, the system demonstrates increased robustness and efficiency in unstructured environments compared to systems employing fixed algorithms.

Our framework utilizes two UR7e robots equipped with Robotiq grippers and CapTac fingertips, alongside two desktop PCs with Nvidia RTX 3060 graphics cards, a professional Nvidia H200X, and two Intel Realsense cameras to enable grasp detection and object classification.

A Complete Robotic System for Precise Garment Manipulation

The robotic system employs a Universal Robots UR7e collaborative robot for garment handling due to its six degrees of freedom and payload capacity, enabling complex manipulation tasks. Operationally, the UR7e receives guidance from a Visual Localization and Mapping (VLM) system, which provides real-time data regarding garment location, orientation, and type. This VLM-derived information is utilized to generate precise robot trajectories for picking, placing, and sorting garments, ensuring accurate and reliable manipulation throughout the handling process. The integration of the VLM minimizes the need for pre-programmed paths and allows the robot to adapt to variations in garment presentation and environmental conditions.

The robotic system incorporates a Digital Twin environment built on the MoveIt motion planning framework. This virtual replica of the physical workspace enables pre-programmed trajectory simulation and optimization prior to execution on the UR7e robot. Utilizing MoveIt’s capabilities, the system can computationally assess potential collision scenarios, refine grasping approaches, and minimize cycle times. This pre-validation reduces the risk of physical errors, improves operational efficiency, and allows for rapid adaptation to changes in garment presentation or system configuration without requiring physical re-programming or testing.

Capacitive tactile sensors are integrated into the robotic system to provide real-time feedback regarding the success of a grasp and to detect potential object loss during manipulation. These sensors measure changes in capacitance caused by physical contact, allowing the system to determine if an object is securely held. Data from the tactile sensors is used to trigger adaptive grasping behaviors; for example, the robot can adjust its grip force or re-grasp an object if slippage is detected. This feedback loop significantly improves the system’s robustness by mitigating failures caused by imperfect initial grasps or disturbances during handling, ultimately increasing the reliability of the garment sorting process.

Evaluation of nine Visual Localization Models (VLMs) from five distinct model families was conducted to determine optimal garment and foreign object identification accuracy. Testing encompassed 219 garments and non-garment items, with the qwen model family consistently demonstrating superior performance compared to other evaluated models. This rigorous testing process confirmed the qwen family’s reliability in accurately classifying objects, which is critical for the robotic system’s sorting capabilities and overall operational efficiency.

The experimental setup is mirrored by a corresponding digital twin visualization in RViz, enabling real-time monitoring and analysis.

Enhancing Robustness Through Precise 3D Reconstruction and Segmentation

Object segmentation plays a crucial role in enabling robots to interact with unstructured garment piles by creating a precise geometric understanding of each item. This technique doesn’t simply identify objects, but delineates their individual boundaries within the cluttered scene, effectively transforming a chaotic pile into a collection of distinct, analyzable forms. By assigning each pixel to a specific garment, the system builds a detailed representation of each item’s shape, size, and orientation. This granular level of detail is then leveraged to predict how each garment will deform under a potential grasp, allowing for more informed grasp planning and minimizing the risk of collisions that often plague robotic manipulation of deformable objects. Consequently, the robot can differentiate between layers, identify potential snag points, and ultimately, achieve a more reliable and efficient handling of the garment pile.

The detailed information derived from object segmentation serves as the foundation for comprehensive 3D reconstruction of the garment pile. By analyzing the segmented data – identifying individual garment boundaries and spatial relationships – algorithms can generate a complete three-dimensional model of the scene. This isn’t merely a visual representation; the model captures precise geometric information, including depth, volume, and surface normals. Consequently, the robot gains a nuanced understanding of the pile’s structure, allowing it to plan grasps that avoid collisions with unseen objects and navigate the complex arrangement of clothing with greater accuracy and efficiency. The resulting 3D model provides a virtual representation of the physical world, enabling the robot to perform simulated grasp attempts and refine its strategies before interacting with the actual garment pile.

The system’s ability to adapt to shifting garment piles hinges on a technique called grayscale image subtraction. By continuously comparing successive images of the pile, even minute changes in the arrangement – a slight shift, a fold, or a newly exposed edge – are highlighted. This isn’t merely about identifying that something changed, but precisely where and how. This differential information is then fed directly into the grasp planning algorithm, allowing the robot to dynamically adjust its approach. Instead of relying on a static understanding of the pile, the system essentially ‘feels’ the changes in real-time, enabling it to anticipate potential collisions and refine its grip for a more secure and reliable grasp, even with deformable and unpredictable materials.

The integration of 3D reconstruction, object segmentation, and grayscale image subtraction dramatically improves a robot’s ability to manipulate unstructured garment piles. Previously, robots struggled with the inherent variability in clothing – differing fabrics, shapes, and the tendency to conform to the contours of the pile itself, leading to failed grasps and potential damage. By building a detailed geometric understanding of each garment and the overall pile structure, the system anticipates potential collisions and adapts its grasp planning accordingly. This approach not only increases success rates across a diverse range of clothing types, from stiff denim to delicate silk, but also minimizes common failure modes like snagging, bunching, or inadvertently pulling the wrong item, ultimately enabling more reliable and efficient automated handling of laundry and textiles.

Towards Collaborative Garment Sorting with Vision-Language-Action Models

The next phase of development centers on equipping robotic systems with the ability to seamlessly transfer garments between each other – a capability known as robot handover. This isn’t simply about passing an object; it demands precise coordination and understanding of garment properties to ensure a secure and damage-free exchange. Implementing this collaborative sorting approach envisions multiple robots working in concert, each specializing in a particular stage of the process, such as identification, folding, or categorization. The anticipated outcome is a significantly more scalable and robust garment sorting solution, capable of handling large volumes of textiles with increased speed and adaptability compared to single-robot systems. This multi-robot interaction promises to unlock a new level of automation within the textile industry, addressing current limitations in efficiency and throughput.

The core of enabling robotic collaboration in garment sorting lies in the development of sophisticated Vision-Language-Action (VLA) models. These models move beyond simple object recognition by integrating visual input with natural language instructions to determine the appropriate physical actions. Effectively, a VLA model must not only ‘see’ a garment and ‘understand’ commands like “move the blue shirt to bin three,” but also translate that understanding into the precise motor commands required for a robotic arm to grasp, lift, and place the item accurately. This necessitates a nuanced understanding of both the visual characteristics of textiles – their drape, flexibility, and potential for entanglement – and the semantics of language used to describe sorting criteria. The successful implementation of VLA models promises a bridge between high-level task specifications and the low-level control of robotic manipulators, allowing for adaptable and intelligent garment handling systems.

The envisioned future of garment sorting extends beyond the capabilities of a single robotic system, and this research actively pursues a scalable solution through multi-robot collaboration. By enabling multiple robots to work in concert, the system aims to significantly increase throughput and adapt to fluctuating demands within textile processing facilities. This collaborative approach not only addresses the limitations of single-robot setups – such as handling diverse garment types or processing large volumes – but also introduces inherent redundancy, improving the system’s robustness against individual robot failures. The development focuses on distributing tasks intelligently, allowing robots to specialize in certain actions or garment categories, ultimately creating a highly efficient and adaptable sorting pipeline capable of meeting the evolving needs of the textile industry.

The development of automated garment handling represents a significant step toward revolutionizing the textile industry, promising substantial reductions in labor costs and improvements in overall efficiency. Current methods often rely heavily on manual sorting and handling, which are both expensive and prone to error; however, fully automated systems, enabled by advances in artificial intelligence and robotics, offer a pathway to streamline these processes. These adaptable systems can respond dynamically to variations in garment types, sizes, and conditions, optimizing workflows and minimizing waste. Ultimately, this research supports the creation of resilient supply chains and positions the textile industry for increased productivity and competitiveness in a rapidly evolving global market.

The pursuit of automated textile sorting, as detailed in this study, demands an uncompromising fidelity to accuracy. The system’s reliance on Visual Language Models – specifically the qwen family’s demonstrated superiority in garment classification and foreign object detection – exemplifies this principle. This aligns perfectly with Fei-Fei Li’s observation: “AI is not about replacing humans; it’s about augmenting human capabilities.” The digital twin approach doesn’t merely automate a task; it elevates the process through precise visual understanding, mirroring a human’s discerning eye and creating a system where every classification and detection is a logical consequence of the input, rather than a probabilistic approximation. The emphasis on provable accuracy, as evidenced by the comparative VLM performance, establishes a foundation built on mathematical purity, a harmony of symmetry and necessity in operation.

Beyond the Sorting Bin

The consistent performance of the qwen model family, while empirically demonstrable, begs a critical question: what constitutes ‘understanding’ in a Visual Language Model? The system accurately classifies and detects, yet lacks a provable, mathematically rigorous foundation for its judgments. The pursuit of higher accuracy, while valuable, should not eclipse the necessity for formal verification. A model that proves its ability to differentiate wool from polyester, rather than merely exhibiting a high success rate on a dataset, represents a fundamental advancement.

Future work must address the limitations inherent in relying solely on visual data. Tactile sensing, coupled with a formal language describing material properties, offers a path towards a truly robust system. The current reliance on large datasets, while yielding impressive results, risks overfitting to spurious correlations. A more elegant solution would involve incorporating prior knowledge – the known physics of fabric behavior, for example – into the model’s architecture. Such an approach, though computationally demanding, aligns with the principle that a correct solution should require no approximation.

Ultimately, the goal extends beyond automating textile sorting. It necessitates building machines that ‘understand’ materials-not merely recognize patterns. The demonstrated digital twin framework provides a valuable platform, but its true potential will only be realized when coupled with formal methods and a commitment to mathematical purity. The elegance of a solution, after all, resides not in its complexity, but in its provable correctness.

Original article: https://arxiv.org/pdf/2603.05230.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/