Where Do Things Go? AI Tackles Everyday Object Storage

Author: Denis Avetisyan

A new study introduces a benchmark and model designed to help robots understand where people typically store household items, bridging a critical gap in commonsense reasoning.

The challenge of identifying stored household items serves as a benchmark for evaluating grounded vision-language models, multimodal LLMs, and the newly developed NOAM model, probing their capacity to connect visual perception with semantic understanding of everyday objects.

Researchers present NOAM, a novel approach and dataset for evaluating spatial and semantic understanding of object storage in vision-language models.

Despite advances in robotics, inferring the likely storage location of everyday household items remains a significant challenge for artificial intelligence. This limitation is addressed in ‘Break Out the Silverware — Semantic Understanding of Stored Household Items’, which introduces a new benchmark dataset and model, NOAM, designed to evaluate and improve commonsense reasoning in service robots. NOAM, a hybrid vision-language agent, significantly improves prediction accuracy, approaching human-level performance in identifying hidden storage locations within domestic environments. Could this integrated approach pave the way for more intuitive and effective robot assistance in our homes?

Beyond Sight: Unveiling the Hidden Logic of Space

Contemporary Vision-Language Models (VLMs), despite remarkable progress in image recognition and natural language processing, frequently falter when tasked with inferring the presence or location of objects not directly visible within an image. This limitation stems from a reliance on directly observed features rather than an ability to reason about the likely contents of a scene based on contextual clues and everyday knowledge. While these models excel at identifying what is present, they struggle with determining what is likely to be present, hindering their application in realistic scenarios where partial observability is the norm. Consider a kitchen scene: a VLM might easily identify a stove and a refrigerator, but struggle to predict the likely location of pots and pans stored inside cabinets – a task requiring an understanding of typical household arrangements and object affordances. This inability to extrapolate beyond the immediately visible poses a significant barrier to deploying VLMs in practical applications like robotics, assistive technology, and truly intelligent virtual assistants.

The Stored Household Item Challenge presents a unique test for vision-language models by deliberately obscuring objects from direct view. Instead of identifying items directly visible in an image, these models must infer the likely location of hidden household goods – a loaf of bread in a pantry, a mop in a utility closet, or a kettle within cupboards. This forces a shift from simple object recognition to a deeper understanding of spatial relationships and everyday contexts. Success isn’t about ‘seeing’ the object, but rather ‘knowing’ where it would logically be stored given the surrounding environment, effectively measuring a model’s ability to apply commonsense reasoning to visual data and predict unobserved elements within a scene.

Addressing the limitations of current vision-language models requires a significant leap in commonsense reasoning – the ability to understand the world not just as it is seen, but as it generally is. Existing approaches often excel at identifying objects directly present in an image, yet falter when asked to infer the presence or location of hidden items, or to predict how objects interact within a scene. This deficiency stems from a reliance on correlational learning – recognizing patterns in observed data – rather than a deeper understanding of physical laws, spatial relationships, and everyday human behavior. Consequently, models struggle with tasks demanding inference – determining what is likely to be true even without direct evidence – hindering their ability to function reliably in dynamic, real-world scenarios where complete information is rarely available.

Given a kitchen scene, the model accurately identifies the likely storage location for a spoon, as indicated by the highlighted bounding box.

NOAM: Reconstructing Reality from Fragments

NOAM departs from traditional storage prediction methods by formulating the task as structured text reasoning. Instead of directly processing visual data to infer the contents of obscured containers, NOAM leverages the capabilities of large language models (LLMs) by converting the storage scenario into a textual problem. This involves representing the storage environment and object relationships as text, allowing the LLM to apply its pre-trained reasoning skills to predict the presence or absence of hidden items. By framing storage prediction as a text-based inference task, NOAM enables the utilization of LLMs, which excel at understanding and manipulating language, to address a traditionally vision-centric challenge.

Prompt Engineering within NOAM’s architecture directs the language model to interpret visual data by formulating specific textual prompts. These prompts aren’t simply descriptions of the scene; they are structured queries designed to elicit inferences about object containment and relationships. The model receives input describing the visual context – for example, “a red ball is near a blue box” – and the prompt instructs it to reason about potential hidden objects, such as “what is likely inside the blue box?”. By framing the storage prediction task as a question-answering process guided by these prompts, NOAM effectively converts visual information into a textual reasoning problem solvable by a large language model, allowing it to move beyond simple pixel-based recognition.

NOAM addresses the challenge of predicting the presence of occluded items by explicitly representing the relationships between objects and their containers. Traditional methods struggle with hidden objects due to reliance on direct visual perception; NOAM, however, models these relationships as structured knowledge. This allows the system to infer the likely presence of an item based on the known contents of its container and the relationships between contained objects, even if the item itself is not visible. Specifically, the model reasons about whether an object should be present given the context, rather than attempting to directly detect it, effectively circumventing the limitations of visual input alone.

Our NOAM pipeline processes input images through container detection, feature extraction, textual explanation, and prompt generation before leveraging a large language model to produce a response, extract value, and evaluate overall performance.

Grounding Perception: Bridging the Digital and Physical Worlds

NOAM utilizes the Grounding-DINO and Segment Anything Model (SAM) to achieve precise object and container identification within real-world kitchen environments. Grounding-DINO facilitates the linking of textual descriptions to visual elements, enabling the system to locate specific objects by name. SAM then performs detailed segmentation, creating pixel-level masks that delineate object boundaries and distinguish them from the background. This combination allows NOAM to not only detect the presence of objects, but also to define their spatial extent and shape, which is crucial for reasoning about containment and relationships within the kitchen scene.

The integration of vision models, specifically Grounding-DINO and SAM, provides NOAM with the capacity to establish a correspondence between textual references and visual elements within a scene. This visual grounding allows NOAM to interpret spatial prepositions – such as “to the left of” or “inside” – and understand relationships between objects and containers. By identifying and segmenting objects, the system can determine their positions relative to each other and to defined areas within the kitchen environment. This spatial understanding is crucial, as it enables NOAM to connect textual instructions – like “put the apple in the bowl” – to the actual visual arrangement of items, forming the basis for accurate reasoning about object locations and potential storage spaces.

NOAM’s ability to infer the storage locations of occluded objects relies on the synergistic integration of visual perception and linguistic reasoning. By processing visual data from models like Grounding-DINO and SAM, NOAM identifies and segments objects and containers within a scene. This visual understanding is then combined with textual information – such as object properties and common sense knowledge about storage habits – to generate hypotheses about likely storage locations. The system evaluates these hypotheses based on the spatial relationships identified in the visual data, ultimately predicting the most probable location of hidden objects even when they are not directly visible. This combined approach allows NOAM to move beyond simple object detection and perform a higher-level inference about object states and locations.

Gemini accurately identifies the likely storage location for a spoon within a kitchen scene, as indicated by the highlighted bounding box.

The Human Benchmark: Measuring Intelligence Through Observation

The foundation of robust artificial intelligence lies in realistic evaluation, and the Stored Household Item Challenge addresses this need with a meticulously constructed dataset. This dataset wasn’t generated synthetically, but rather through careful annotation by human evaluators who identified likely storage locations for everyday objects. This approach ensures the benchmark reflects the complexities of real-world spatial reasoning, moving beyond simplified simulations. By leveraging human insights, the challenge provides a nuanced and reliable measure of a model’s ability to understand and predict how people organize their belongings – a critical step towards building AI systems capable of truly intelligent interaction with human environments.

The accuracy of NOAM’s predicted storage locations is assessed through a metric known as Intersection over Union, or IoU. This quantitative measure determines the overlap between the predicted bounding box of an object’s storage location and the ground truth – the actual, annotated location provided by human annotators. Specifically, IoU calculates the area of intersection between these two boxes, divided by the area of their union $IoU = \frac{Area_{intersection}}{Area_{union}}$ . A higher IoU score indicates greater accuracy, with a perfect score of 1.0 signifying a complete overlap. Utilizing IoU allows for a precise and objective evaluation of NOAM’s performance, moving beyond simple correctness to capture the degree of spatial alignment between prediction and reality, and enabling meaningful comparisons to human-level performance on the Stored Household Item Challenge.

The NOAM model recently achieved a 23% accuracy score on the Stored Household Item Challenge, a result that signifies considerable progress in artificial intelligence’s ability to reason about object permanence and spatial relationships. This performance isn’t simply a matter of numerical achievement; it demonstrates that NOAM is closing the gap with human capabilities in a complex task – predicting where objects are stored when out of sight. Critically, this 23% accuracy places the model closer to the performance of the least accurate human annotator (27%) than the difference between that annotator and the next most accurate (36%). This suggests that NOAM isn’t merely mimicking patterns, but is developing a rudimentary form of understanding about how humans conceptualize storage locations, offering a promising step towards more intuitive and human-like AI systems.

The annotation tool efficiently facilitates the collection of human-labeled data through a user-friendly interface.

The pursuit of artificial intelligence capable of genuine commonsense hinges on dismantling assumptions about how objects relate to their environments. This work, focused on household item storage, exemplifies that principle. It isn’t enough for a system to recognize a fork; it must infer where a fork is likely to be found. As Henri Poincaré observed, “Mathematics is the art of giving reasons, and mathematical reasoning is distinct from reasoning in other sciences.” This distinction mirrors the challenge presented by NOAM; it demands a system move beyond pattern recognition to embody the reasoned understanding of spatial relationships and everyday physics that humans possess. Every exploit starts with a question, not with intent, and this benchmark seeks to expose the gaps in AI’s ability to ask – and answer – those fundamental questions about the world.

Where Do We Go From Here?

The creation of a benchmark, even one focused on the ostensibly mundane task of locating the silverware, invariably exposes the brittle nature of current systems. NOAM, and similar architectures, represent a local maximum – a successful exploit of comprehension within a narrowly defined problem space. However, the underlying assumptions regarding object affordances and spatial relationships remain largely unexamined. True robustness will require moving beyond datasets constructed with human biases; the robot doesn’t care why a spatula is usually near the stove, only that it is.

A critical next step involves fracturing the problem. Current vision-language models treat ‘kitchen’ as a monolithic entity. Future iterations must decompose it into functional regions – the ‘cooking zone,’ the ‘cleaning station,’ the ‘food preparation area’ – and understand how object storage is dictated by these sub-spaces. This isn’t merely about spatial reasoning; it’s about reverse-engineering the implicit choreography of human domestic life.

Ultimately, the real challenge isn’t building a system that mimics human object storage, but one that transcends it. A genuinely intelligent system shouldn’t simply know where things usually are; it should be able to anticipate where they should be, given an evolving environment and unforeseen circumstances. That requires a level of predictive modeling that current architectures, focused on pattern recognition, are ill-equipped to handle.

Original article: https://arxiv.org/pdf/2512.23739.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Beyond Sight: Unveiling the Hidden Logic of Space

NOAM: Reconstructing Reality from Fragments

Grounding Perception: Bridging the Digital and Physical Worlds

The Human Benchmark: Measuring Intelligence Through Observation

Where Do We Go From Here?

See also: