Building with Words: AI Learns to Assemble Physical Objects

Author: Denis Avetisyan

Researchers have developed a new system that uses artificial intelligence to design and create functional craft assemblies from simple text prompts.

Leveraging large language models, the system iteratively refines craft assembly proposals through prompting and verification, enabling a resilient process where failed attempts trigger re-prompting to ensure successful completion.

This work introduces Prompt2Craft, a framework leveraging large language models and physics simulation to achieve high success rates in 3D geometric and functional craft assembly.

While robotic assembly typically relies on pre-defined parts, replicating the human capacity for improvisational creation from available objects remains a challenge. This paper introduces ‘Prompt2Craft: Generating Functional Craft Assemblies with LLMs’, a novel framework that leverages large language models to address this ‘craft assembly task’-building accurate representations of target objects using a limited set of dissimilar components. Our approach achieves high success rates in generating functional and visually coherent crafts, validated through physics simulation and 3D geometric reasoning. Could this paradigm shift enable robots to not only assemble, but create in open-world environments?

The Elegance of Assembly: Beyond Simple Recognition

The construction of three-dimensional craft assemblies from images presents a significant challenge beyond mere object recognition; it necessitates a complex understanding of component relationships and spatial reasoning. Unlike identifying a chair in a photograph, building a model of a chair from an image demands the system infer how individual parts connect, their relative positions, and the overall structural logic. This process isn’t about labeling objects, but about decomposing a visual representation into a set of buildable components and then reasoning about their interdependencies-a task requiring an appreciation for both the aesthetic intent of the design and the physical constraints governing stable construction. Consequently, a system must move beyond identifying what is present in the image to understanding how it should be built, effectively bridging the gap between visual perception and physical realization.

Existing computational approaches to 3D assembly often falter when tasked with interpreting images and constructing stable, physical models. These methods frequently prioritize geometric reconstruction – accurately mirroring the visual form – without adequately addressing principles of physics and engineering. Consequently, generated structures may appear correct in a rendering, yet prove inherently unstable or impossible to build in the real world, lacking the necessary support for their own weight or exhibiting collisions between components. The challenge lies not simply in recognizing parts, but in inferring the underlying structural relationships and ensuring the resulting digital model respects constraints like gravity, material strength, and contact mechanics – a level of reasoning that traditional computer vision and graphics pipelines struggle to provide.

Automated craft assembly presents a unique computational challenge, requiring systems to navigate the complex interplay between visual appeal and physical feasibility. A successful solution transcends mere object recognition; it necessitates an understanding of aesthetic preferences – the desired overall look and feel of the finished product – while simultaneously enforcing the laws of physics. The system must not only identify components within an image but also deduce how those parts relate spatially, and then construct a stable, three-dimensional structure capable of withstanding real-world forces. This demands an integrated approach, one that balances the intangible qualities of design with the concrete requirements of structural integrity, effectively bridging the gap between visual intention and physical realization.

Our method leverages large language models to generate craft assembly designs, which are then validated for functional correctness through physics simulation.

Orchestrating Creation: LLMs as Digital Artisans

Off-the-shelf, multi-modal Large Language Models (LLMs) are utilized as the core planning component for automated craft assembly. These LLMs, pre-trained on extensive datasets of text and images, provide the reasoning capability necessary to interpret assembly requirements and generate sequential instructions. By employing existing LLMs, the system avoids the substantial computational costs and data requirements associated with training a custom model. The LLM receives high-level assembly goals and, through its inherent understanding of objects and relationships, formulates a plan detailing the necessary steps to construct the final product. This approach allows for flexible adaptation to different assembly tasks without requiring model retraining, relying instead on the LLM’s generalized knowledge and reasoning abilities.

The implementation utilizes a Template Format to direct the Large Language Model (LLM) in generating assembly plans. This format predefines the constituent parts of the assembly and explicitly specifies their expected spatial relationships to one another. By constraining the LLM’s output to adhere to this predefined structure, the system ensures that generated plans are not only syntactically correct but also semantically valid with respect to the physical construction process. The template functions as a schema, guiding the LLM to populate it with actionable steps while maintaining a coherent and buildable sequence of operations. This approach minimizes the generation of illogical or physically impossible assembly instructions.

Constraining Large Language Model (LLM) outputs through a defined Template Format ensures generated assembly plans adhere to semantic relationships between components. This structured guidance moves beyond simple textual instruction, producing plans that explicitly detail parts and their intended connections – information critical for downstream 3D construction processes. By enforcing a consistent, predictable output structure, the system facilitates automated interpretation of the plan, enabling robotic systems or simulation software to directly utilize the LLM’s output as a build sequence without requiring further natural language processing or ambiguous interpretation.

Our method consistently generates more structured 3D crafts compared to baselines like PartCrafter*, which rely on part separation and scene object selection, and TripoSG, which directly infers the model from masked images.

From Blueprint to Reality: Rigorous Validation Protocols

Format Validation is the initial stage of our validation pipeline, ensuring all generated parts conform to pre-defined specifications. This process verifies that each component adheres to the established template, including correct data types, acceptable value ranges for parameters such as dimensions and material properties, and the presence of all required fields. Validation checks are performed against a schema defining the permissible structure and content of each part’s data representation. Any component failing these checks is flagged and excluded from subsequent stages, preventing the propagation of invalid data and ensuring data integrity throughout the design process. This stage operates independently of physical simulation, focusing solely on syntactic and structural correctness.

Collision Validation is a critical step following Format Validation, designed to identify and resolve geometric intersections between the digitally generated components of the assembled craft. This process utilizes algorithms to detect instances where parts occupy the same space, which would result in an impossible physical configuration. Detected collisions trigger adjustments to component positioning or geometry, ensuring that all parts are spatially separated and can theoretically coexist within the final assembly. The system does not merely flag overlaps; it actively works to resolve them before proceeding to the more computationally intensive Physics Simulation stage, thereby improving overall process efficiency and preventing simulation errors.

Physics Simulation employs the NVIDIA Isaac Sim platform to evaluate assembled craft for stability and operational functionality. This stage utilizes a physics engine to model realistic environmental interactions, including gravity, friction, and aerodynamic forces. The simulation subjects the craft to a range of predefined motions and loads, assessing joint stress, center of mass stability, and potential failure points. Data generated includes telemetry on actuator performance, structural deflection, and overall system responsiveness, allowing for iterative design refinement prior to physical prototyping. Validation metrics focus on ensuring the craft maintains a stable configuration and can successfully execute its intended maneuvers within defined operational parameters.

Our method most frequently fails due to collisions between robotic parts.

Measuring Success: Functional Integrity and Aesthetic Fidelity

Functional success represents the foundational measure of performance in automated craft assembly, extending beyond mere completion to assess the resulting structure’s integrity and operational capability. This metric rigorously evaluates whether the assembled craft exhibits stability – resisting collapse or deformation – and validity, confirming adherence to design specifications. Critically, functional success demands the craft not only look complete, but demonstrably perform its intended function, whether that involves supporting a load, moving through a defined range of motion, or interacting with its environment. By prioritizing this holistic assessment, the framework moves beyond superficial evaluations to provide a robust indicator of genuine assembly achievement and reliability.

The newly developed, Large Language Model (LLM)-based framework demonstrated an initial ability to successfully guide automated craft assembly, achieving a functional success rate of 63.4%. This initial performance, while representing a significant step towards fully automated creation, was determined by evaluating whether the assembled virtual craft was structurally stable, geometrically valid, and capable of fulfilling its intended design purpose. The framework’s capacity to independently generate assembly instructions and oversee their execution, even at this early stage, highlights the potential for LLMs to revolutionize the field of digital fabrication and design automation, offering a promising foundation for further refinement and improvement.

The initial automated craft assembly process demonstrated a functional success rate of 63.4%; however, integration of a re-prompting system, coupled with feedback from a dedicated validation pipeline, yielded a substantial improvement, elevating the success rate to 85.0%. This represents a noteworthy advancement in the field of automated assembly, as the system learns from its initial errors and iteratively refines its approach. The validation pipeline identifies specific failure points – such as component instability or invalid configurations – and communicates this information back to the language model, which then adjusts its instructions accordingly. This iterative process of prompting, validation, and refinement highlights the potential for closed-loop systems to significantly enhance the reliability and efficiency of complex automated tasks, paving the way for more sophisticated and autonomous manufacturing processes.

Initial attempts at automated craft assembly frequently encountered issues stemming from collisions between components, representing 69.1% of all failures. These instances weren’t simply dead ends, however; the framework was designed to learn from these errors. By re-prompting the language model with specific feedback derived from collision detection and subsequent simulation, the system substantially reduced these failures. This iterative process, leveraging insights from simulated physics, allowed for adjustments to placement and orientation, ultimately boosting the success rate in avoiding collisions to 77.1%. The ability to self-correct based on virtual testing highlights a key advancement in the robustness and practicality of automated assembly techniques.

Rigorous evaluation of the proposed framework involved direct comparison with the established PartCrafter system, revealing notable advancements in automated craft assembly. The study assessed performance across two critical dimensions: functional success, ensuring the assembled creation is stable and capable of performing its intended purpose, and aesthetic fidelity, measuring the visual quality and alignment with design specifications. Results indicate that the new approach consistently surpasses PartCrafter in both areas, achieving a higher rate of successfully assembled, structurally sound crafts, and demonstrably improved visual appeal as judged by established metrics. This benchmark performance highlights the potential for automated systems to not only replicate but also enhance the quality and complexity of physical creations.

Towards Intelligent Creation: Adaptive Assembly and Error Correction

Future development centers on embedding robust error handling directly within the Large Language Model’s assembly process. This involves creating automated systems capable of identifying inconsistencies, logical fallacies, or physically impossible configurations within the generated output. Rather than relying on external validation, the model will incorporate internal checks – akin to a self-editing function – that pinpoint flawed elements. Once detected, these errors won’t simply flag issues, but will trigger corrective actions, such as re-evaluating planning steps or generating alternative solutions. This proactive approach promises to dramatically improve the reliability and precision of the LLM’s creations, minimizing the need for human intervention and maximizing the potential for fully autonomous complex task completion.

The system’s ability to self-correct and improve hinges on the implementation of re-prompting techniques, effectively establishing a continuous cycle of refinement. Following an initial planning phase executed by the Large Language Model, validation results are analyzed to identify shortcomings or inconsistencies in the generated output. These findings then inform a revised prompt, which is reintroduced to the model, guiding it toward a more accurate and robust solution. This iterative process, akin to a closed-loop feedback system, allows the model to learn from its mistakes and progressively enhance the quality of its assembly plans without requiring explicit human intervention. The anticipated outcome is a dynamically adaptive framework capable of addressing challenges and optimizing performance in real-time, ultimately paving the way for fully automated and error-resistant crafting processes.

The development of an adaptive assembly framework signifies a pivotal step towards automating traditionally intricate crafting processes. This system transcends simple automation by dynamically adjusting its approach based on real-time validation, allowing for the creation of complex products with minimal human intervention. Beyond simply replicating existing designs, this framework promises to unlock novel possibilities in design and manufacturing, enabling the rapid prototyping and production of customized goods and entirely new product categories. The ability to autonomously assemble components, coupled with error correction, fosters a pathway towards on-demand manufacturing, reduced waste, and ultimately, a paradigm shift in how goods are conceived, created, and delivered to consumers.

The pursuit of functional craft assembly, as demonstrated in this work, echoes a fundamental principle of elegant design. It isn’t simply about achieving a desired outcome, but how that outcome is realized. As David Marr observed, “Representation is the key; if you have the right representation, everything else follows.” This research embodies that sentiment, utilizing large language models to construct not just any assembly, but one validated by physics simulation-a representation grounded in reality. The framework’s success hinges on a precise and coherent internal representation of spatial relationships and functional constraints, allowing the LLM to ‘reason’ about the assembly process and generate viable, aesthetically pleasing results. A good interface, in this case the prompt engineering, is invisible to the user, yet felt, in the high success rate of functional crafts.

Future Directions

The successful marriage of large language models and physics simulation, as demonstrated, feels less like an arrival and more like a carefully constructed overture. The current framework excels at generating assemblies, but elegance-that subtle harmony between form and function-remains elusive. Future work must address not simply if an assembly stands, but how it stands. A truly insightful system would discern, without explicit prompting, the most efficient, stable, and visually pleasing solution, not merely a functional one.

A critical limitation lies in the inherent ambiguity of the “Craft Assembly Task” itself. The framework presently relies on external validation; it doesn’t inherently understand what constitutes a good design. The field should move toward systems capable of internal critique, of assessing their own creations against principles of structural integrity and aesthetic balance. This necessitates incorporating more sophisticated geometric reasoning and a deeper understanding of material properties.

Ultimately, the challenge is not to build machines that mimic human craft, but to surpass it. To achieve this, the focus must shift from prompt engineering-a somewhat clumsy method of communication-toward systems that can autonomously discover and refine design principles. Only then will the promise of truly intelligent craft assembly be realized, and the resulting creations will whisper, not shout, their inherent logic.

Original article: https://arxiv.org/pdf/2512.04568.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/