Mapping the City with AI: How Models Are Learning to Navigate the Real World

Author: Denis Avetisyan

Researchers are pushing the boundaries of artificial intelligence by challenging models to understand and navigate complex urban environments using only visual and textual cues.

A new benchmark, CityNav, assesses multimodal models’ ability to perform embodied navigation in realistic city settings, revealing that explicitly verbalizing a path significantly improves performance by leveraging internal world knowledge.

Despite advances in artificial intelligence, reliably navigating real-world environments remains a significant challenge for embodied agents. This work, ‘City Navigation in the Wild: Exploring Emergent Navigation from Web-Scale Knowledge in MLLMs’, introduces CityNav, a benchmark designed to evaluate multimodal large language models (MLLMs) in complex, knowledge-intensive urban settings. Our evaluations reveal that current state-of-the-art MLLMs struggle with this task, but a novel method called Verbalization of Path-which explicitly elicits internal spatial reasoning-substantially improves navigation success. Can leveraging an agent’s inherent world knowledge unlock truly robust and adaptable navigation capabilities in the wild?

Navigating the Unknown: The Challenge of Sparse Grounding

Conventional navigation technologies often depend on pre-existing, highly detailed maps and constant external localization signals – such as GPS – to determine position and plan routes. While effective in familiar surroundings, this reliance creates a critical vulnerability when encountering novel or changing environments. These systems struggle when deprived of precise, pre-mapped data or reliable external cues, exhibiting limited adaptability to unforeseen obstacles or alterations in the landscape. The inherent rigidity of these approaches contrasts sharply with human spatial reasoning, which allows for flexible path planning and intuitive navigation even in completely uncharted territory, highlighting a significant gap in the capabilities of current automated systems.

Successfully navigating real-world urban landscapes requires more than simply recognizing landmarks; it demands sparsely grounded visual navigation – the ability to chart a course and respond to unforeseen obstacles using only limited, unlabelled visual input. Unlike systems reliant on pre-mapped environments or detailed annotations, this approach mirrors human spatial reasoning, where individuals formulate plans and adapt to surroundings based on incomplete visual cues. This presents a formidable challenge to artificial intelligence, forcing algorithms to develop a deeper understanding of spatial relationships, object affordances, and predictive modeling of environmental changes – essentially, learning to ‘understand’ a space rather than merely ‘memorize’ it. The capacity for sparsely grounded navigation is therefore a key indicator of an AI’s ability to generalize and operate autonomously in dynamic, unpredictable settings.

Current artificial intelligence systems often struggle with real-world navigation because they frequently rely on memorized routes and visual patterns rather than true spatial understanding. This reliance on memorization proves brittle when faced with unfamiliar environments or unexpected changes, highlighting a critical limitation in their ability to generalize. Successfully navigating dynamic, real-world spaces demands a shift towards systems capable of abstracting environmental features, forming robust internal representations of space, and proactively planning paths based on those representations – capabilities that move beyond simple pattern recognition and approach genuine cognitive reasoning. Achieving this requires developing algorithms that prioritize understanding the underlying principles of spatial relationships, rather than merely storing and recalling specific visual experiences, thereby enabling more flexible and adaptable navigation in complex settings.

MLLMs as Spatial Interpreters: A Foundation for Intelligent Agents

Sparsely grounded navigation, a key challenge in robotics and agent AI, arises from the limited availability of precise location data and reliance on incomplete or ambiguous environmental information. Multimodal Large Language Models (MLLMs) address this by fusing visual input – typically from cameras or depth sensors – with textual data such as map information, street signs, or natural language instructions. This integration allows the MLLM to infer relationships between perceived visual features and semantic concepts, effectively enriching the environmental representation beyond what is directly observable. By leveraging pre-trained knowledge from large datasets, MLLMs can reason about unobserved areas, interpret ambiguous cues, and generalize to novel environments, thereby improving navigation performance in data-scarce scenarios.

Multimodal Large Language Models (MLLMs) demonstrate the capacity to utilize pre-existing knowledge acquired during training on extensive datasets to perform complex navigational tasks. This capability extends beyond simple object recognition; MLLMs can infer relationships between objects, understand contextual cues within urban scenes – such as traffic signals or pedestrian crossings – and apply learned reasoning to generate feasible routes. By integrating visual input with their internal knowledge base, these models can make informed decisions regarding path selection, obstacle avoidance, and adherence to navigational rules, even in dynamic and unstructured environments. This allows for operation without explicit, task-specific training for each new environment.

The efficacy of Multimodal Large Language Models (MLLMs) in navigation tasks is fundamentally dependent on their capacity to establish a correspondence between visual input and semantic understanding. This “grounding” process involves interpreting raw pixel data from cameras or other sensors and associating it with meaningful concepts, object recognition, and spatial relationships. Successful grounding enables the MLLM to move beyond simply detecting objects to understanding their relevance to navigation – for example, recognizing a “crosswalk” not just as a pattern of stripes, but as a permissible location to traverse a street. This semantic understanding is then translated into discrete, actionable steps – such as “turn left at the next intersection” or “proceed forward until reaching the traffic light” – which drive the agent’s movement through the environment. Without robust grounding and subsequent translation into executable actions, the MLLM cannot reliably navigate even relatively simple environments.

Enhancing Navigational Acuity: Methods for Improved Performance

Employing Chain-of-Thought reasoning and Verbalization of Path enhances Multi-Modal Large Language Model (MLLM) navigation by explicitly detailing the rationale behind each movement. Chain-of-Thought prompts the MLLM to break down complex navigational challenges into a series of intermediate steps, allowing for more systematic and traceable decision-making. Verbalization of Path extends this by requiring the MLLM to articulate its intended route before and during execution, creating an interpretable log of its progress. This approach not only improves accuracy but also facilitates debugging and analysis of navigational failures, leading to more robust performance in complex environments.

Markovian Memory and Previous Visit Tracking enhance an agent’s navigational consistency by creating a localized environmental representation. This is achieved by storing information about recently visited locations and associated observations as a state, allowing the agent to estimate future states based solely on the current state – a core principle of Markovian decision processes. Previous Visit Tracking specifically records locations the agent has already explored, preventing redundant traversal of the same areas and facilitating efficient path planning. This combination ensures the agent maintains spatial awareness, reduces navigational errors, and improves overall performance in complex environments by avoiding loops and optimizing route selection.

The implementation of a ‘Decision History’ allows the agent to utilize prior navigational choices as contextual data for subsequent path planning, facilitating adaptive behavior and improved route selection. Specifically, the ‘Verbalization of Path’ mechanism, which records and references this decision history, has demonstrated a 92% success rate in completing long-range urban navigation tasks. This performance represents a significant advancement compared to existing navigational methods lacking such contextual awareness and learning capabilities.

Validating Intelligence: Benchmarking with CityNav

The CityNav dataset leverages the extensive imagery available through Google Street View and its associated graph representation, the Google Street View Graph, to create a highly realistic environment for evaluating the performance of navigation agents. This dataset differs from synthetic environments by offering real-world visual complexity, including variations in lighting, weather, and urban scenes. The Google Street View Graph provides a structured representation of navigable routes, enabling the creation of complex navigational tasks and scenarios. The scale of both the imagery and graph allows for robust testing of agent generalization capabilities and provides a challenging benchmark for assessing progress in visual navigation and reasoning.

Evaluation of navigation performance utilizes a comparative analysis of several large multimodal language models (MLLMs), specifically GPT-4, Gemini, and Qwen-2.5VL. These models are subjected to a suite of navigational tasks within the CityNav dataset, encompassing scenarios designed to assess their ability to interpret visual input and generate appropriate navigational instructions. Performance metrics focus on successful completion of the tasks, with the dataset providing a standardized benchmark for quantifying the efficacy of each model’s navigational reasoning and instruction-following capabilities. Variations in scenario complexity and environmental conditions are included to provide a comprehensive assessment of model robustness.

Performance gains achieved through iterative refinement, specifically utilizing a ‘Reflection’ mechanism with models like GPT-5, have been quantitatively demonstrated using the CityNav dataset. Implementation of the Verbalization of Path (VoP) technique resulted in a 92% success rate for navigational tasks. This represents a substantial improvement over the baseline performance of 15%, indicating that the ability to iteratively refine plans and responses significantly enhances the efficacy of Multi-Modal Large Language Models (MLLMs) in complex navigational scenarios.

Charting the Course: Future Directions in Autonomous Navigation

Advancing autonomous navigation hinges on developing more sophisticated reasoning capabilities within artificial intelligence. Current systems often struggle with unpredictable environments due to limitations in how they process information and recall past experiences. Researchers are now focusing on innovative memory architectures, moving beyond simple data storage to systems that can prioritize, associate, and retrieve relevant information with greater efficiency. Concurrently, attention mechanisms are being refined to allow agents to selectively focus on the most crucial aspects of a scene, filtering out distractions and enhancing processing speed. These combined efforts aim to create AI that doesn’t just react to stimuli, but actively understands its surroundings and anticipates future events, ultimately enabling truly robust and adaptable navigation in the real world.

The capacity for autonomous agents to navigate unpredictable real-world scenarios hinges on their ability to move beyond learned data and incorporate broader understanding. Current systems often struggle with situations not explicitly encountered during training; integrating external knowledge sources, such as knowledge graphs or readily available databases, offers a potential solution. More critically, equipping these agents with common-sense reasoning – the intuitive understanding of physical laws, social norms, and everyday objects – promises to bridge the gap between data processing and genuine comprehension. This allows an agent to, for example, infer that a closed door likely presents an obstacle, or that a wet surface might be slippery, even without prior experience. By combining readily accessible information with an inherent grasp of the world’s basic principles, future autonomous systems can exhibit far greater resilience and adaptability when faced with the inevitable complexities of genuine navigation.

The long-term vision driving research in autonomous navigation extends beyond simply enabling robots to move from point A to point B. It centers on developing artificial intelligence agents capable of fluidly integrating into human environments and providing meaningful assistance across diverse applications. These agents are envisioned to perform tasks ranging from logistical support in warehouses and delivery services to providing companionship and aid to individuals with mobility challenges. Successful realization of this goal necessitates not only sophisticated navigational abilities, but also a capacity for understanding complex human intentions, adapting to dynamic surroundings, and operating safely and reliably in unpredictable real-world scenarios, ultimately fostering a collaborative relationship between humans and intelligent machines.

The pursuit of robust navigation within multimodal large language models, as demonstrated by CityNav, necessitates a deep understanding of how these systems internally represent and utilize world knowledge. It’s a process akin to discerning patterns – a core tenet of insightful analysis. Yann LeCun aptly observes, “Everything we do is pattern recognition.” This resonates strongly with the ‘Verbalization of Path’ method, which forces the model to articulate its reasoning, revealing the underlying patterns it employs for spatial understanding. By explicitly eliciting this internal knowledge, researchers can effectively diagnose weaknesses and refine the model’s ability to navigate complex urban environments, ultimately improving its capacity for reliable spatial reasoning.

Beyond the Street View: Charting a Course for Embodied Intelligence

The introduction of CityNav, and the observed efficacy of prompting models to explicitly ‘think aloud’ via verbalized paths, highlights a curious pattern. Performance gains aren’t necessarily about seeing more, but about structuring what is already known. This suggests a fundamental limitation in current multimodal large language models: a reliance on correlation over genuine spatial understanding. The benchmark itself, while a valuable step towards realism, remains constrained by the fidelity – and inherent biases – of the Google Street View data. What lies beyond the captured image? What assumptions are embedded within the very act of creating a navigable digital twin of a city?

Future work must address the problem of incomplete knowledge. Models currently excel at interpolating between observed states, but struggle with true extrapolation – envisioning pathways not directly visible within the training data. Investigating methods for incorporating prior maps, architectural blueprints, or even abstract city planning data could reveal whether these models can truly reason about urban spaces, or are simply sophisticated pattern-matching engines. The ‘Verbalization of Path’ technique is promising, but its effectiveness relies on the model’s ability to accurately articulate its internal reasoning. Can this articulation be reliably assessed, and more importantly, can it be used to diagnose and correct flawed spatial understandings?

Ultimately, the pursuit of embodied intelligence in complex environments demands a shift in evaluation. Metrics focused solely on successful navigation miss the crucial point: the quality of the understanding that underpins that navigation. Perhaps the true benchmark lies not in reaching a destination, but in the ability to convincingly explain why a particular route was chosen, accounting for both observed features and unobserved possibilities.

Original article: https://arxiv.org/pdf/2512.15933.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/