AI-Generated Databases: Training Smarter Systems with Synthetic Workloads

Author: Denis Avetisyan


Researchers are leveraging the power of generative artificial intelligence to automatically create realistic database workloads, accelerating the development and optimization of key database components.

This review explores how generative AI, particularly large language models, can address challenges in data acquisition and scalability for training learned database systems, focusing on areas like cardinality estimation and query optimization.

Despite advances in deep learning for database optimization, acquiring sufficient, high-quality training data remains a critical bottleneck. This paper, ‘Automated Training of Learned Database Components with Generative AI’, investigates the potential of generative AI models to automatically synthesize realistic database workloads for training learned components like cardinality estimators and query optimizers. Initial findings suggest these models can effectively augment existing datasets, improving the adaptability of database techniques-but challenges around scalability and accurate labeling persist. Could generative AI fundamentally reshape how we develop and deploy intelligent database systems, moving beyond reliance on scarce, real-world data?


The Illusion of Static Workloads

Conventional database performance evaluation frequently employs static workloads – pre-defined sets of queries repeatedly executed against a system. However, this approach often fails to accurately reflect the dynamic and varied demands of real-world applications. These static benchmarks typically represent a narrow slice of potential query patterns, neglecting the long-tailed distributions and unpredictable access patterns characteristic of production environments. Consequently, optimization efforts guided by static workloads may yield misleading results, leading to systems that perform well under contrived conditions but falter when confronted with the complexity and unpredictability of genuine user activity. This discrepancy between benchmark performance and real-world behavior underscores the critical need for more representative and dynamic testing methodologies.

Existing database benchmarks often present a limited and artificial view of actual data access patterns, hindering effective performance optimization. These standardized tests typically emphasize a narrow range of query types and data distributions, failing to reflect the unpredictable and varied workloads characteristic of real-world applications. Consequently, database systems tuned to excel on these benchmarks may exhibit suboptimal performance when confronted with the complexity of genuine user queries – queries that involve diverse data relationships, intricate filtering criteria, and unpredictable access frequencies. This discrepancy between benchmark performance and real-world behavior stems from the inability of static benchmarks to accurately model the long-tailed distributions and evolving patterns inherent in production databases, ultimately leading to flawed optimization strategies and potentially significant performance bottlenecks.

Robust database systems demand evaluation beyond static benchmarks; the true test lies in their ability to handle dynamically generated workloads mirroring real-world complexity. Current testing methodologies often fall short, failing to account for the unpredictable diversity of query patterns encountered in production environments. Consequently, optimization efforts based on these limited tests can be misdirected, leading to suboptimal performance when faced with authentic user demands. The creation of representative workloads-those evolving in both structure and volume-is therefore paramount for effective database development and tuning, enabling developers to proactively identify and address performance bottlenecks before they impact users and ensuring systems can reliably scale with growing data and increasing query loads.

Synthesizing Complexity: Generative Models for Database Workloads

Generative models, specifically those based on large language models, provide an automated method for creating database queries that vary in complexity and structure. Our research demonstrates the feasibility of using these models to produce SQL queries without manual construction. This is achieved by prompting the model with schema information and desired query characteristics, allowing it to generate syntactically correct and semantically valid queries. The generated queries are not limited to simple select statements; they include joins, aggregations, and other complex operations, enabling the creation of workloads that more closely resemble real-world database access patterns. The scale of query generation is significantly higher than traditional methods, facilitating the exploration of a wider range of potential database operations and edge cases.

Utilizing large language models, such as GPT, enables the programmatic generation of SQL queries that are syntactically correct and conform to a defined database schema. This is achieved through prompting strategies that incorporate schema information and desired query characteristics. Furthermore, these models can be conditioned to produce queries exhibiting realistic access patterns – including variations in query complexity, table joins, and data filtering – by training on existing query logs or through reinforcement learning techniques. The resulting queries are not merely valid SQL, but also statistically representative of workloads commonly observed in production database systems, facilitating more robust performance evaluation and system optimization.

Traditional database benchmarking relies on static, pre-defined query sets which offer limited coverage of potential workload variations. Programmatic query generation, utilizing techniques like generative models, allows for the automated creation of a significantly larger and more diverse range of queries. This approach enables the exploration of a vastly expanded workload space, including queries with varying complexity, data access patterns, and selectivity. By systematically altering parameters within the generation process, researchers and developers can simulate scenarios not easily captured by manually crafted benchmarks, leading to more robust and comprehensive system evaluation and optimization. This capability is crucial for identifying performance bottlenecks and ensuring database systems can handle real-world, unpredictable workloads.

Programmatically generated workloads facilitate the training and evaluation of learned database components, such as learned index structures, query optimizers, and caching mechanisms. Traditional benchmarks often lack the diversity required to fully assess these components across a representative range of access patterns. By utilizing generated queries, developers can expose learned components to a broader spectrum of workload characteristics, improving generalization and performance. Specifically, generated workloads allow for controlled experimentation, enabling precise measurement of component behavior under varying conditions and facilitating iterative refinement of learning algorithms. This approach moves beyond static evaluation, enabling data-driven optimization and ultimately boosting overall system performance metrics like query throughput and latency.

The Foundation of Optimization: Selectivity and Cardinality Estimation

Accurate selectivity and cardinality estimation are foundational to query optimization processes within database management systems. Selectivity, representing the fraction of rows that satisfy a given predicate, and cardinality, the number of rows in a result set, directly impact the choice of execution plans. Incorrect estimations can lead to suboptimal plans, resulting in significantly increased query response times and resource consumption. The query optimizer uses these estimations to evaluate the cost of different execution strategies – such as index scans versus full table scans – and selects the plan with the lowest predicted cost. Consequently, even small inaccuracies in selectivity or cardinality can propagate through the optimization process, leading to substantial performance degradation, particularly for complex queries involving multiple joins and filters.

Generative models, including those based on the GPT architecture, offer a data-driven approach to predicting query selectivity. These models are trained on datasets characterizing data distributions and indexing configurations, allowing them to learn the relationship between query predicates and the resulting fraction of rows returned. Input features typically include predicate types, data types, distinct value counts, minimum and maximum values, and index statistics such as the number of unique keys and the depth of the B-tree. The trained model then outputs a predicted selectivity, represented as a probability between 0 and 1, indicating the estimated proportion of rows that satisfy the query. Accuracy is improved by training on diverse query workloads representative of the expected operational load and by incorporating information about composite indexes and data skew.

Parameterized queries and text-to-SQL techniques significantly improve workload generation by moving beyond static query examples. Parameterization allows a single query template to represent numerous distinct queries through variable substitution, increasing the diversity of the generated workload without requiring the manual creation of each variation. Text-to-SQL methods, which translate natural language questions into SQL queries, automate this process further, enabling the creation of a wider range of queries based on different semantic expressions of the same underlying data requests. Combining these approaches produces workloads that more accurately reflect the distribution and complexity of real-world database usage, which is essential for robust training and evaluation of query optimization components.

The creation of synthetic datasets, generated through techniques like parameterized queries and text-to-SQL, enables the training and validation of machine learning models designed for database systems. These datasets provide a controlled environment for evaluating learned components, such as query optimizers and index selectors, without reliance on production data or manual labeling. The ability to generate varied and representative workloads allows for robust testing across a range of query patterns and data distributions, directly demonstrating the feasibility of applying generative AI, as detailed in our research, to improve database performance and automation.

A System in Perpetual Refinement: Expanding Workloads and Continuous Improvement

The capacity to expand existing database workloads is significantly enhanced through the application of generative models. These models don’t simply replicate existing queries; they synthesize entirely new ones, often designed to probe the limits of the database system and expose previously unseen performance characteristics. This augmentation process introduces challenging queries that go beyond typical usage patterns, effectively stress-testing the database and its optimization strategies. By continually adding these synthetic queries to the workload, database systems can adapt and improve their ability to handle complex and varied data requests, leading to a more robust and efficient overall performance. The result is a dynamic system capable of evolving alongside increasing data volumes and user demands.

The iterative process of workload expansion doesn’t simply increase the volume of database queries, but actively cultivates a system for continuous improvement. Each new, challenging query generated serves as a real-world test case, exposing areas where learned database components – such as indexing structures or query planners – can be refined. This feedback loop allows the database to adapt and optimize its strategies, effectively ‘learning’ from its experiences with diverse query patterns. Consequently, the system doesn’t reach a static state of optimization; instead, it undergoes perpetual refinement, bolstering performance and efficiency as it encounters a broader spectrum of data requests and analytical tasks. The result is a database that becomes increasingly adept at handling complex queries, paving the way for more sophisticated data analysis and application performance.

A comprehensive understanding of database performance hinges on identifying and resolving bottlenecks, and a systematic exploration of diverse query patterns proves crucial in this endeavor. By moving beyond typical, well-understood queries and actively generating a wider range of inputs-including those that represent edge cases or unusual access patterns-researchers can expose hidden limitations within the database system. This proactive approach allows for targeted optimization efforts, focusing on the specific areas that exhibit performance degradation under stress. The ability to anticipate and address these bottlenecks before they impact real-world users leads to a more robust, scalable, and efficient database solution, ultimately improving overall system responsiveness and data accessibility.

Analysis of query generation time revealed a direct correlation with the number of queries processed, as detailed in Table 2. While initial expectations suggested a linear increase in processing time with expanding workloads, the study unexpectedly demonstrated a slight reduction in average query generation time as the number of queries increased. This counterintuitive finding indicates a potential for scalability within the system; the process appears to become more efficient with larger datasets, possibly due to caching mechanisms or optimized algorithms that learn from repeated patterns. Further investigation is warranted to fully understand and exploit this phenomenon, but these early results suggest the architecture is well-positioned to handle continuously expanding workloads without significant performance degradation.

The pursuit of efficient database systems, as outlined in the study, benefits from a relentless simplification of process. The work demonstrates a move away from laborious, manual workload creation towards automated synthetic data generation. This aligns with a core principle: a system that needs extensive manual intervention has already failed to achieve its potential. Tim Berners-Lee aptly stated, “The Web is more a social creation than a technical one.” This applies here; the generative AI models facilitate a more natural, scalable ‘social’ interaction with database optimization, removing layers of complexity and fostering a more intuitive system. The reduction in human effort isn’t merely about convenience, but about achieving a more elegant and effective solution.

What Lies Ahead?

The automation of database component training, as demonstrated, merely shifts the locus of complexity. The generative models themselves-the LLMs-become the new bottleneck. Their capacity to faithfully represent real-world workload distributions remains an open question, and the pursuit of ‘realistic’ synthetic data risks replicating existing biases at scale. The elegance of automated training is diminished if the trained components merely amplify imperfections already present in the generative process.

Future work must address the inherent opacity of these models. A focus on interpretability-understanding why a generative model produces a particular workload-is crucial. The current emphasis on volume-generating ever-larger datasets-should yield to a concern for quality-ensuring the data meaningfully exercises the target database components. The ultimate metric isn’t the size of the synthetic dataset, but the demonstrable improvement in database performance under true, unpredictable conditions.

The ambition of fully automated optimization carries a subtle irony. It seeks to remove human intervention, yet relies on increasingly complex, and ultimately human-designed, generative systems. The goal should not be to replace human expertise, but to augment it-to provide tools that allow database architects to focus on fundamental design principles, rather than the tedious task of workload creation. The simplification of the process should reveal, not obscure, the underlying truths.


Original article: https://arxiv.org/pdf/2512.20271.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-12-25 03:21