Data’s Detective: Agents That Reason Over Metadata

Author: Denis Avetisyan

A new approach empowers AI agents to autonomously discover and integrate data from complex sources by reasoning about what the data means, not just what it contains.

The Metadata Reasoner operates as an autonomous ecosystem, navigating a metadata catalog to identify and deliver a minimal set of relevant tables for a given analytic task, and substantiating its choices with natural language reasoning-a process reliant on both pre-existing metadata and dynamically accessed information.

This paper introduces Metadata Reasoner, an agentic system that surpasses traditional vector search for data discovery and integration by leveraging metadata reasoning over complex data lakes.

As large language model-driven agents tackle increasingly complex data-intensive tasks, identifying relevant data sources from vast and often noisy data lakes remains a critical bottleneck. This paper introduces ‘An Agentic Approach to Metadata Reasoning’, presenting the Metadata Reasoner-an agentic system that autonomously discovers and selects optimal datasets by reasoning over available metadata. Through empirical evaluation on benchmarks like KramaBench and a novel synthetic dataset, we demonstrate that this approach achieves state-of-the-art accuracy, exceeding baseline methods by a substantial margin and maintaining robustness against redundant or low-quality tables. Could agentic metadata reasoning unlock a new paradigm for efficient and reliable data integration in complex analytical workflows?

The Allure and Illusion of Data Lakes

Despite the allure of democratized data access, contemporary data lakes often present a paradox of potential versus practicality. While designed to consolidate diverse information sources, these repositories frequently become bogged down by inherent complexity. The very characteristics that define a data lake – its scale, variety, and velocity of incoming data – contribute to significant quality issues. Data arrives in numerous formats, often lacking consistent structure or clear metadata, necessitating extensive cleaning and transformation efforts. This, in turn, hinders analytical efficiency, as data scientists spend considerable time preparing data rather than deriving meaningful insights. The result is a substantial increase in costs and a frustrating slowdown in the delivery of data-driven solutions, ultimately diminishing the return on investment for organizations embracing this technology.

The promise of rapid insight from data lakes often clashes with the reality of protracted data discovery processes. Traditional methods, reliant on manual exploration or predefined schemas, falter when confronted with the sheer volume and diverse formats characteristic of these repositories. This struggle isn’t merely a matter of inconvenience; it directly translates to escalating costs as data scientists spend valuable time locating, understanding, and preparing data instead of analyzing it. The inherent heterogeneity – encompassing structured, semi-structured, and unstructured data from numerous sources – demands increasingly complex and time-consuming queries and transformations. Consequently, organizations find themselves facing delayed project timelines, diminished return on investment, and a growing backlog of untapped data potential, effectively negating the very benefits a data lake is designed to provide.

The reliability of analytical outcomes from modern data lakes is frequently compromised by inherent data quality issues. Specifically, the accumulation of Data Noise – irrelevant or erroneous entries – alongside pervasive Data Redundancy, where the same information is stored multiple times, creates a distorted view of underlying patterns. Critically, this is often compounded by incomplete or missing Schema Information, meaning the structure and meaning of the data are poorly defined or inconsistently applied. Consequently, analytical processes must contend with inaccuracies, increased computational demands for data cleaning, and the potential for fundamentally flawed conclusions, ultimately diminishing the value derived from the data lake investment.

The escalating complexity of modern data lakes necessitates a move beyond manual data source selection towards systems capable of intelligent automation. Current analytical inefficiencies stem from the sheer volume and variety of data, often requiring extensive effort to identify relevant and reliable sources. Automated approaches leverage metadata analysis, data profiling, and machine learning algorithms to assess data quality, lineage, and relevance to specific analytical tasks. This proactive selection process minimizes the impact of data noise and redundancy, ensuring analysts work with curated datasets. Ultimately, intelligent automation not only accelerates insight generation but also reduces costs associated with data preparation and improves the trustworthiness of derived results, allowing organizations to fully realize the potential of their data lake investments.

The Metadata Reasoner accurately identified 99.0% of relevant tables in a synthetic, noisy data lake, demonstrating its robustness to data imperfections.

A New Paradigm: Intelligent Metadata Reasoning

The Metadata Reasoner is an LLM-Driven Autonomous Agent designed to automate the identification of data sources required for specific analytical tasks. This agent operates by evaluating available metadata to determine the minimal set of data sources – tables, views, or files – that collectively contain the information necessary to fulfill a given analytical request. Unlike traditional approaches requiring manual data source selection, the Metadata Reasoner dynamically assesses data relevance and sufficiency, aiming to reduce redundant data access and streamline analytical workflows. The agent’s autonomous nature allows it to operate without direct user intervention in the data source selection process, increasing efficiency and scalability.

The Metadata Reasoner utilizes both Attached Metadata – pre-existing descriptive information accompanying datasets, such as schema details and lineage – and On-the-Fly Metadata, dynamically generated attributes like data profiles and statistical summaries. This combination enables a relevance assessment that goes beyond the limitations of static data catalogs, which typically rely on manually curated descriptions. By continuously evaluating data characteristics during query processing, the agent can identify suitable data sources based on current analytical needs, even if those sources weren’t explicitly indexed or tagged in a traditional catalog. This dynamic approach ensures a more comprehensive and accurate evaluation of data relevance, improving the efficiency of analytical workflows.

The agentic workflow utilizes discrimination-oriented metadata to generate embeddings that emphasize distinguishing features of each table within a data lake. This approach moves beyond simple keyword matching by focusing on characteristics that differentiate tables, such as unique column combinations, data distributions, or functional dependencies. These embeddings are then used to perform semantic searches, allowing the agent to identify tables that are conceptually relevant to a given analytical task, even if they do not share explicit keywords with the query. The resulting increase in search precision minimizes the retrieval of irrelevant data sources, optimizing the data source selection process and reducing computational overhead.

Automated data source selection addresses a critical bottleneck in data lake utilization by minimizing the manual effort currently required to identify relevant datasets for analysis. Traditional approaches necessitate significant time from data engineers and analysts to locate, validate, and prepare data, often resulting in underutilization of available resources. By autonomously assessing data relevance based on metadata and employing an agentic workflow, the system streamlines this process, reducing the time to insight and enabling broader access to data lake assets. This automation is projected to increase analytical productivity and unlock the full potential of data lakes, allowing organizations to derive greater value from their data investments by accelerating data-driven decision-making.

The Metadata Reasoner workflow decomposes analytical queries into search plans, validates entities and relationships, and ultimately delivers a justified selection of joinable tables sufficient to address the original task.

Validation with KramaBench: Real-World Performance

KramaBench is a benchmark designed to assess data reasoning capabilities in scenarios reflecting real-world data environments. It simulates the complexities of production data lakes by incorporating messy, incomplete, and diverse datasets sourced across multiple domains. The benchmark’s construction prioritizes realistic data characteristics, including schema variations, data quality issues, and the presence of irrelevant information, to provide a robust evaluation of systems like the Metadata Reasoner. This contrasts with simplified benchmarks that often lack the nuance of actual data landscapes, making KramaBench a more stringent and representative test of performance in practical applications.

The BIRD Dataset served as the foundation for evaluating the Metadata Reasoner’s performance in a realistic setting. This dataset was synthetically scaled to increase its volume and complexity, simulating the characteristics of large-scale data environments. Furthermore, Data Partitioning techniques were applied to the BIRD Dataset, introducing the fragmented and distributed nature common in real-world data lakes. This augmentation ensured the benchmark accurately reflected the challenges associated with data discovery and access in complex, partitioned data repositories, going beyond simple, consolidated datasets.

Text-to-SQL accuracy serves as the primary performance metric, specifically quantifying the correctness of SQL queries generated from natural language inputs. SQL Execution Accuracy is determined by executing the generated SQL query against the underlying database and verifying that the returned results match the expected ground truth. This evaluation method assesses not only the syntactic correctness of the SQL, but also its semantic validity in retrieving the appropriate data. A higher SQL Execution Accuracy indicates a more reliable and precise ability to translate user queries into actionable database commands, directly reflecting the effectiveness of the Metadata Reasoner in understanding and interpreting data requests.

Evaluation on the KramaBench benchmark demonstrates the Metadata Reasoner’s superior performance in data selection, achieving an average F1-score of 83.16%. This result represents a substantial improvement over baseline methods; vector search attained an F1-score of 50.77%, while Pneuma achieved 45.12% under the same testing conditions. The F1-score metric assesses the harmonic mean of precision and recall, providing a balanced measure of the system’s ability to accurately identify relevant data within the KramaBench dataset.

The Metadata Reasoner accurately identifies relevant tables-ranked 5th and 11th in a vector search-by decomposing complex analytical tasks into searchable variables and verifying data presence to ensure complete and precise table selection.

Beyond Automation: The Future of Data Exploration

The process of data exploration is frequently hindered by the considerable time investment needed to locate appropriate data sources and translate natural language questions into functional database queries. Recent advancements directly address these bottlenecks through automation; by intelligently selecting relevant data sources and significantly improving the accuracy of text-to-SQL conversion, systems can now drastically reduce the effort required for initial data investigation. This acceleration not only streamlines the workflow for data analysts but also empowers a broader range of users to independently access and interpret data, fostering a more data-driven approach to decision-making across organizations. The resulting efficiency gains allow teams to focus less on technical hurdles and more on extracting meaningful insights, ultimately unlocking the full potential of available information.

The capacity to accelerate data exploration directly translates into a competitive advantage for organizations across all sectors. By diminishing the time required to locate and interpret critical information, businesses can respond more swiftly to evolving market conditions and emerging opportunities. This isn’t merely about efficiency; it’s about fostering a culture where decisions are consistently informed by evidence, rather than intuition. A data-driven organization, empowered by rapid insights, can optimize processes, identify previously unseen trends, and ultimately, allocate resources with greater precision, leading to improved outcomes and sustained growth. The resulting agility allows for more effective risk management, innovative product development, and a deeper understanding of customer needs, cementing a position as a leader in its field.

Current data exploration often relies on methods like vector and semantic search to identify relevant information, but these approaches can be limited in their ability to prioritize the most pertinent results. A ranking-based retrieval system addresses this by intelligently re-ordering search outputs, ensuring that data with the highest probability of relevance appears first. This integration doesn’t simply find more data-it enhances both precision, minimizing irrelevant results, and recall, maximizing the capture of all relevant information. By combining the broad discovery capabilities of vector and semantic search with a focused ranking mechanism, data scientists can significantly reduce the time spent sifting through results and accelerate the process of deriving meaningful insights from complex datasets.

Recent evaluations demonstrate a significant advancement in data exploration capabilities, particularly within complex astronomical and biological datasets. A novel Metadata Reasoner achieved an F1-score of 72.31% when applied to astronomy data, more than doubling the performance of traditional vector search methods, which registered a score of 32.80%, and substantially exceeding the 27.70% achieved by Pneuma. This enhanced performance extends to the BIRD dataset, where the Reasoner attained an impressive 85.5% F1-score, again significantly surpassing the 30.0% of vector search. Further validation came through SQL Execution Accuracy, with the system achieving 71.28% – a marked improvement over the baseline of 56.38%, indicating a superior ability to accurately translate complex queries into actionable data retrieval.

The pursuit of seamless data integration, as explored in this work, inevitably reveals the limitations of any fixed architecture. Metadata Reasoner, with its agentic approach to data discovery, doesn’t build a solution so much as cultivate one-a system adapting to the inherent chaos of complex data lakes. It’s a pragmatic acceptance of change, acknowledging that dependencies will always outlive the technologies intended to manage them. As G.H. Hardy observed, ‘The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge.’ This agentic system, therefore, doesn’t presume complete understanding of the metadata landscape, but rather navigates it with a cautious, evolving strategy – a compromise frozen in time, yet capable of refreezing.

What Lies Ahead?

The pursuit of autonomous data discovery, as exemplified by this work, does not resolve the fundamental tension inherent in any system built upon information. It merely relocates the points of failure. Metadata Reasoner proposes an elegant choreography of agents, reasoning over descriptions of data, but the descriptions themselves remain brittle approximations of reality. The system’s success is predicated on the assumption that metadata accurately reflects the underlying data-a prophecy perpetually on the verge of being disproven by data drift, schema evolution, and the inevitable human errors in curation. Order is just cache between two outages.

Future iterations will undoubtedly focus on robustness – on agents that can detect, and even correct, flawed metadata. But the deeper challenge lies in acknowledging that complete accuracy is an asymptotic goal. The system will never truly understand the data; it can only become increasingly adept at navigating its representations. The real innovation will not be in refining the reasoning process, but in accepting, and gracefully degrading around, inevitable ambiguity.

There are no best practices-only survivors. The long game isn’t about building perfect data catalogs or knowledge graphs; it’s about building systems that can adapt to their imperfections. The next frontier isn’t more powerful agents, but more humble ones-systems that recognize their limitations and prioritize resilience over absolute truth. Architecture is how one postpones chaos, not defeats it.

Original article: https://arxiv.org/pdf/2604.20144.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Allure and Illusion of Data Lakes

A New Paradigm: Intelligent Metadata Reasoning

Validation with KramaBench: Real-World Performance

Beyond Automation: The Future of Data Exploration

What Lies Ahead?

See also: