Seeing Around Corners: AI Tracks Vehicles Across Multiple Cameras

Author: Denis Avetisyan


Researchers have developed a novel AI system that uses map data to follow vehicles even when they move out of view of individual security cameras.

SPOT, a map-guided large language model agent, achieves unsupervised multi-camera tracking by integrating spatial reasoning and trajectory prediction.

Reliable multi-camera tracking remains challenging due to inevitable blind spots and fragmented trajectories in complex CCTV environments. This paper introduces ‘SPOT!: Map-Guided LLM Agent for Unsupervised Multi-CCTV Dynamic Object Tracking’, a novel approach leveraging large language models and spatial reasoning to maintain continuous vehicle tracking even without prior training. By integrating map information with observed vehicle movements and employing a beam search strategy, SPOT accurately predicts likely re-appearance locations after blind spots. Could this map-guided LLM framework unlock more robust and scalable solutions for real-time surveillance and autonomous navigation?


Decoding Complexity: The Challenge of Multi-Camera Tracking

Conventional object tracking systems often falter when faced with the complexities of real-world surveillance. Dynamic environments – bustling city streets or crowded shopping malls – introduce unpredictable motion and frequent changes in scene layout, challenging algorithms designed for static backgrounds. Critically, the presence of occlusions – where objects are temporarily hidden behind others or by environmental features – disrupts tracking continuity. This issue is significantly exacerbated when leveraging multiple, uncoordinated CCTV cameras; each camera provides a limited, often disjointed, view of the scene, making it difficult to maintain a consistent object identity across camera perspectives. The lack of synchronization and differing viewpoints introduce data association problems, requiring sophisticated algorithms to correctly link fragmented observations and reconstruct complete trajectories, a task that remains a substantial hurdle in reliable multi-camera tracking.

The seamless continuation of object tracking is paramount in real-world applications demanding constant situational awareness, such as intelligent traffic management systems and public safety surveillance. However, current methodologies frequently encounter difficulties when faced with the complexities of dynamic environments – think crowded city streets or rapidly changing weather conditions. These systems often struggle to maintain object identities through occlusions – when an object is temporarily hidden – or during abrupt movements, leading to fragmented trajectories and lost data. This breakdown in continuous tracking hinders accurate analysis, impedes proactive responses to incidents, and ultimately diminishes the effectiveness of systems designed to ensure safety and optimize resource allocation.

Current multi-camera tracking systems often struggle when an object moves outside the field of view of one camera, relying heavily on immediate visual confirmation from each sensor. This limitation stems from a difficulty in establishing robust spatial reasoning; the systems lack the ability to infer an object’s likely position based on its previous trajectory and an understanding of the physical environment. Consequently, tracking is frequently interrupted when an object is temporarily obscured or moves between camera perspectives, hindering continuous monitoring. Advancements require algorithms capable of predicting object paths, effectively extrapolating beyond the immediate sensor range and leveraging contextual information – such as known road layouts or pedestrian walkways – to maintain a consistent track even when visual data is limited or absent.

SPOT: Weaving Spatial Logic with Large Language Models

SPOT addresses the challenge of tracking objects across multiple CCTV cameras without manual annotation by combining Large Language Models (LLMs) with map-based spatial data. This integration enables unsupervised multi-camera tracking, eliminating the need for pre-defined trajectories or labeled training data. The system leverages the LLM’s reasoning capabilities, augmented by map information detailing camera positions and road networks, to infer object movement and maintain tracking continuity. Unlike traditional tracking methods reliant on visual features alone, SPOT’s approach allows it to reason about potential object locations and re-identify targets even after temporary occlusions or when transitioning between camera views.

SPOT employs Retrieval-Augmented Generation (RAG) to enhance the Large Language Model’s (LLM) capacity for spatial understanding. This process involves retrieving pertinent map data – including road network layouts and precise camera positions – from a vector database and presenting it to the LLM as contextual information. The retrieved data is formatted as text prompts, allowing the LLM to reason about geographic relationships and object locations within the environment. Specifically, RAG facilitates the LLM’s ability to correlate camera views, understand road connectivity, and determine plausible object trajectories based on the provided map context, without requiring explicit training on spatial data.

SPOT’s path inference and predictive capabilities are achieved by representing the tracked environment as a graph of Waypoints – specific, identifiable locations within the camera network’s field of view. These Waypoints are defined within a consistent World Coordinate System, allowing the system to calculate object trajectories and extrapolate movement beyond individual camera perspectives. This coordinate system enables SPOT to reason about an object’s position relative to the entire monitored area, even when the object temporarily leaves the view of one or more cameras – effectively addressing Blind Spots. By analyzing the sequence of visited Waypoints and applying path prediction algorithms, SPOT estimates the likely future location of an object, enabling continuous, unsupervised tracking across multiple cameras.

Validating Intelligence: Performance in Simulated Environments

Performance evaluation of SPOT was conducted within the CARLA simulator, a widely used open-source platform for autonomous driving research. This environment allowed for precise control over variables such as lighting, weather, and traffic density, facilitating repeatable and statistically significant testing of tracking accuracy and robustness. The CARLA simulator provides ground truth data for object positions and velocities, enabling quantitative assessment of tracking errors and a comparative analysis against baseline methods. Rigorous testing within this controlled environment ensured reliable measurement of SPOT’s performance characteristics independent of real-world complexities.

Comparative analysis against a Heuristic Baseline revealed SPOT’s superior performance in maintaining consistent target tracking. The baseline system exhibited a significantly higher frequency of track losses and discontinuities, particularly in scenarios involving occlusions or rapid movements. SPOT, conversely, demonstrated improved tracking continuity, minimizing instances where the target was no longer reliably localized. Quantitative metrics indicated a demonstrable reduction in the number of lost tracks achieved by SPOT, validating its enhanced robustness and ability to maintain target identification over extended periods compared to the baseline approach.

Performance evaluations incorporated multiple Large Language Models (LLMs) to assess SPOT’s flexibility across varying reasoning aptitudes. Testing with DeepSeek-chat and Llama-3 revealed that DeepSeek-chat yielded the lowest error rates; specifically, the model achieved an Average Displacement Error (ADE) of 3.64 meters and a Frontal Displacement Error (FDE-X) of 3.16 meters during tracking experiments. These metrics demonstrate the system’s ability to maintain accurate positional estimates even when leveraging different LLM inference capabilities.

SPOT utilizes Beam Search as a path exploration and selection mechanism to predict CCTV locations following a period of obscured visibility, termed a Blind Spot. This algorithm maintains a set of likely paths, iteratively evaluating and pruning them based on predicted probabilities. The implementation assesses multiple hypotheses concurrently, allowing for efficient exploration of the solution space. Quantitative evaluation demonstrates a 70% Top-1 success rate, indicating that the correct CCTV location was identified as the most probable outcome in 70% of tested Blind Spot scenarios.

Beyond Observation: Charting a Course Towards Anticipatory Systems

SPOT’s core innovation extends far beyond merely following objects; it establishes a framework for anticipating events within a monitored space. By reasoning about the spatial relationships between entities and predicting their likely paths, the system can proactively identify potentially hazardous situations-such as a pedestrian entering a restricted zone or an object moving against the flow of traffic-before they fully unfold. This capability enables a shift from reactive alerts to preventative measures, and extends to anomaly detection, flagging unusual behaviors or patterns that deviate from established norms. The system doesn’t simply report what is, but rather assesses what might be, creating opportunities for automated intervention or heightened situational awareness in complex environments.

The confluence of large language models and detailed map information presents a remarkably adaptable and expandable approach to object tracking within challenging urban landscapes. This system doesn’t simply register positions; it leverages the semantic understanding of LLMs – recognizing what is being tracked – and combines this with precise geospatial data. This allows for tracking even when objects are temporarily obscured, predicting movement based on contextual understanding of the environment – such as road networks or pedestrian walkways – and seamlessly scaling to cover vast areas with numerous cameras. Unlike traditional tracking systems reliant on rigid programming, this integration fosters a degree of flexibility, enabling the system to learn and adapt to evolving urban dynamics and novel situations without extensive recalibration, promising a future where intelligent tracking anticipates rather than merely records movement.

Continued development of SPOT aims to move beyond static map data and embrace the complexities of real-world urban environments. Researchers are actively integrating real-time sensor inputs – such as those from lidar and radar – to allow the system to adapt to unpredictable events and dynamic changes in pedestrian or vehicle movement. Furthermore, exploration into generative models promises to refine SPOT’s predictive capabilities, allowing it to not simply forecast likely trajectories, but to anticipate a range of possible futures and proactively identify potential anomalies or incidents before they unfold. This shift towards a more adaptive and anticipatory system represents a significant step towards truly intelligent urban surveillance and incident management.

Effective camera handover in multi-camera tracking systems, such as SPOT, fundamentally relies on a precise understanding of each camera’s field of view. This isn’t simply about knowing the extent of the visual range; it’s about calculating the geometric overlap between adjacent cameras and accurately predicting when an object will transition from one camera’s FOV to another. Without this precise spatial awareness, tracking continuity is compromised, leading to dropped objects or inaccurate trajectories. The system must account for lens distortion, camera positioning, and even potential obstructions to reliably determine if an object remains within the overall monitored area, ensuring a seamless transfer of tracking responsibility between cameras and maintaining consistent observation across a wide environment.

The development of SPOT demonstrates a powerful synergy between spatial understanding and predictive reasoning. By integrating map data with large language models, the system transcends simple object detection, achieving robust tracking even when visual information is limited. This approach mirrors the importance of identifying underlying patterns to decipher complex systems. As Andrew Ng aptly stated, “Machine learning is about learning the mapping from inputs to outputs.” SPOT exemplifies this by learning the mapping between visual inputs, spatial data, and predicted vehicle trajectories, effectively navigating blind spots and showcasing the potential of LLMs in dynamic environments. The beam search algorithm further refines this mapping, ensuring accurate and consistent tracking.

Where Do the Roads Lead?

The successful integration of spatial maps with large language models, as demonstrated by SPOT, feels akin to equipping a navigating organism with a more complete proprioceptive sense. It’s no longer merely seeing motion, but understanding it within a persistent, relational framework. Yet, the illusion of complete tracking remains just that – an illusion. Blind spots, while mitigated, still represent information loss, and the beam search, while effective, introduces a computational cost that scales with environmental complexity. Future work must address this inherent tension between comprehensive awareness and efficient processing.

A compelling direction lies in exploring the system’s capacity for predictive mapping. Rather than reacting to observed trajectories, could SPOT anticipate vehicle movements based on learned patterns of behavior – essentially, modeling the ‘intent’ of objects within the scene? This pushes the boundary towards genuine spatial reasoning, moving beyond mere correlation to something resembling causal understanding. Furthermore, the current reliance on pre-existing maps presents a limitation; a truly robust system should be capable of incrementally building and refining its spatial knowledge – a form of autonomous cartography.

Ultimately, the quest for perfect tracking feels like chasing a thermodynamic ideal – achievable only in a closed system. Real-world environments are inherently noisy and unpredictable. Perhaps the most fruitful avenue for future research isn’t striving for flawless reconstruction of trajectories, but instead developing methods for gracefully handling uncertainty – for quantifying and communicating the limits of perception, and building systems that are robust to inevitable errors.


Original article: https://arxiv.org/pdf/2512.20975.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-12-27 19:11