Scaling Agentic AI: A New System for Complex ML Pipelines

Author: Denis Avetisyan

Researchers have developed Stratum, a system infrastructure designed to efficiently manage and execute large-scale machine learning workflows driven by intelligent agents.

Agentic Pipeline Search, despite its intent, frequently devolves into the wasteful execution of numerous machine learning pipelines that ultimately fail to optimize for the desired outcome.

Stratum optimizes Python-based workflows as lazily evaluated directed acyclic graphs and executes them across heterogeneous hardware backends for massive agent-centric ML workloads.

The increasing prevalence of large language model-driven agentic pipeline search presents a fundamental mismatch with existing machine learning systems, which prioritize human-interactive workflows over massively parallel execution. To address this challenge, we introduce stratum: A System Infrastructure for Massive Agent-Centric ML Workloads, a unified system that decouples pipeline execution from agent planning and optimizes Python-based workflows as lazily evaluated directed acyclic graphs. Stratum seamlessly integrates with existing libraries and efficiently executes pipelines across heterogeneous backends using a novel Rust-based runtime, achieving up to 16.6x speedup in preliminary experiments. Can this infrastructure unlock a new era of scalable, autonomous machine learning pipeline development and optimization?

Unraveling the Bottlenecks: Why Modern ML Pipelines Struggle

Modern machine learning workflows frequently depend on Python-based libraries for data manipulation and model building, yet these systems can encounter limitations when confronted with intricate data transformations and the demands of large datasets. While offering flexibility, the sequential nature of many Python operations hinders parallel processing, creating performance bottlenecks as data volume increases. Consequently, pipelines struggle to efficiently handle the preprocessing steps-such as cleaning, feature engineering, and data type conversions-necessary for robust model training. This reliance on serial computation and the overhead of Python’s dynamic typing often restricts scalability, requiring significant manual optimization to achieve acceptable performance with real-world, complex data.

The pursuit of optimized machine learning pipelines frequently necessitates manual integration and refinement of individual components, a process vividly illustrated by systems such as Weld. While offering granular control, this approach proves remarkably time-consuming, demanding substantial engineering effort to connect and tune disparate data processing steps. Consequently, even minor alterations to the pipeline-accommodating new data sources or refining existing transformations-can trigger a cascade of manual adjustments and debugging. This inherent fragility significantly hinders rapid iteration, delaying experimentation and ultimately slowing the pace of model development; the potential for human error in these complex, hand-crafted systems further exacerbates the issue, introducing subtle bugs that can compromise model accuracy and reliability.

The increasing complexity of modern machine learning pipelines encounters a substantial bottleneck when processing diverse data types. Traditional systems, while effective with homogenous tabular data, struggle to efficiently integrate and transform the varied formats present in multimodal datasets – encompassing images, text, and numerical features. This disparity arises from the need for specialized processing logic for each modality, often requiring extensive manual intervention to ensure compatibility and optimal performance. Consequently, data scientists spend considerable time addressing these integration challenges rather than focusing on model development and refinement, significantly slowing down the iterative process and hindering the potential of advanced machine learning applications. The inability to seamlessly handle diverse data not only impacts development speed but also limits the scope of models that can effectively leverage the richness of information available in real-world datasets.

Agentic pipeline search, as demonstrated on an AIDE workload, reveals a correlation between code changes and resource utilization (CPU/Memory).

The Rise of the Autonomous Pipeline: Agentic AI and the LLM Revolution

Agentic AI signifies a fundamental change in machine learning (ML) pipeline development through the implementation of LLM-Backed MLE Agents. These agents utilize large language models (LLMs) to automate tasks traditionally performed by data scientists and ML engineers, encompassing the entire pipeline lifecycle from data ingestion to model deployment. Unlike conventional approaches requiring extensive manual coding and configuration, agentic AI enables autonomous pipeline creation and execution, dynamically adjusting to data characteristics and specified objectives. This automation is achieved by allowing the agent to independently make decisions regarding data transformation, feature engineering, model selection, and hyperparameter tuning, effectively shifting the paradigm from manual pipeline construction to automated, agent-driven processes.

Agentic AI systems utilize Large Language Model (LLM)-backed agents to automate both data profiling and machine learning pipeline search. Data profiling involves the autonomous analysis of datasets to determine data types, ranges, missing values, and statistical distributions, eliminating the need for manual exploratory data analysis. Pipeline search then leverages these insights to independently configure and evaluate various ML pipeline architectures – including feature engineering, model selection, and hyperparameter tuning – against defined performance metrics. This automated process substantially reduces the manual effort traditionally required for both tasks, resulting in accelerated development cycles and increased efficiency in building and deploying machine learning models.

Automated exploration of machine learning pipeline configurations utilizes algorithms to systematically test different combinations of data preprocessing steps, feature engineering techniques, model architectures, and hyperparameters. This process adapts to varying data characteristics by incorporating data profiling results – such as statistical distributions, missing values, and data types – as inputs to the search strategy. Performance goals, defined through metrics like accuracy, precision, recall, or F1-score, guide the evaluation of each configuration, allowing the agent to prioritize and refine the pipeline search space. Consequently, the system can identify optimal or near-optimal pipelines more efficiently than manual or grid-search approaches, especially in high-dimensional configuration spaces.

Stratum: Engineering Efficiency for Agentic Pipeline Execution

Stratum is a machine learning system engineered for the efficient execution of large-scale agentic pipeline searches. Unlike general-purpose ML frameworks, Stratum is specifically designed to handle the computational demands of iterative pipeline construction and evaluation common in agent-based systems. This specialization allows for optimizations targeting pipeline-centric workloads, resulting in increased throughput and reduced latency when compared to adapting existing systems. The system’s architecture prioritizes the rapid instantiation and execution of numerous pipeline variations, facilitating exhaustive search within a defined problem space and enabling the discovery of high-performing solutions at scale.

Stratum utilizes Domain-Specific Languages (DSLs) to enable full-program compilation and optimization of agentic pipelines. Specifically, the system supports integration with existing DSLs including SystemML, SystemDS, OptiML, KeystoneML, and DAPHNE. This approach allows Stratum to move beyond per-operator optimization and perform optimizations across the entire pipeline, leveraging the DSL’s inherent knowledge of the underlying computational patterns and enabling techniques like operator fusion, memory management, and parallelization strategies tailored to the specific DSL being utilized. By compiling the entire pipeline, Stratum avoids runtime overhead associated with interpreting individual operations and facilitates more aggressive optimization passes.

Stratum employs Directed Acyclic Graphs (DAGs) to represent batches of pipelines, enabling parallelization and efficient scheduling of operations. This DAG-based representation is coupled with a Rust runtime environment, chosen for its memory safety, speed, and concurrency features, facilitating optimized execution. Further performance gains are achieved through the incorporation of both Logical Optimization, which rewrites pipeline expressions to more efficient forms, and Operator Selection, which dynamically chooses the most appropriate execution strategy for each operation within the pipeline, based on data characteristics and system resources.

System validation was performed using the AIDE Agent to generate diverse, real-world workloads representative of agentic pipeline execution scenarios. Testing demonstrated that Stratum is capable of achieving pipeline execution rates exceeding thousands per second. This performance level was consistently maintained across the generated workloads, confirming the system’s scalability and efficiency in handling high-throughput agentic tasks. The AIDE Agent’s workload generation allowed for benchmarking under conditions simulating practical application demands, providing a reliable measure of Stratum’s operational capacity.

Beyond Automation: Charting the Future of Intelligent Pipelines

Stratum introduces a novel approach to machine learning development through autonomous pipeline search, effectively automating the entire process from initial data ingestion to a fully trained model. The system employs intelligent agents capable of independently constructing complete data science pipelines, eliminating the need for extensive manual configuration and experimentation. These agents not only select appropriate data preprocessing steps and feature engineering techniques but also determine the optimal machine learning algorithms and hyperparameters for a given task. This end-to-end automation significantly reduces the time and expertise required to deploy effective models, allowing data scientists to focus on higher-level strategic initiatives and fostering a more rapid cycle of innovation in the field of machine learning.

The adaptability of Stratum extends beyond traditional datasets, encompassing both structured tabular data and the complexities of multimodal inputs like images, text, and audio. This broad compatibility significantly widens the potential for automated machine learning solutions; previously, building pipelines for varied data types demanded specialized expertise and bespoke configurations. By seamlessly integrating with diverse data modalities, Stratum democratizes access to end-to-end automation, enabling researchers and practitioners to rapidly prototype and deploy models across a wider range of applications – from financial forecasting with structured data to image recognition and natural language processing with unstructured inputs – all without extensive manual intervention.

Stratum’s architecture incorporates a sophisticated parallelization planning module, designed to overcome the computational bottlenecks inherent in complex machine learning pipelines. This component dynamically analyzes pipeline structures and intelligently distributes tasks across distributed computing systems. By identifying independent operations and dependencies, Stratum maximizes resource utilization and minimizes overall execution time. The system doesn’t simply divide the workload; it strategically assigns tasks to optimize data transfer and communication overhead, ensuring efficient scalability even with increasingly intricate models and large datasets. This capability moves beyond simple automation to enable the practical deployment of advanced ML solutions that were previously limited by computational constraints, fostering a pathway towards truly scalable and adaptive intelligence.

The development of Stratum signals a potential paradigm shift in machine learning, fostering systems capable of independent optimization and refinement. By minimizing the requirement for human-driven pipeline construction and tuning, the framework enables a more fluid and responsive approach to model development. This reduction in manual intervention not only accelerates the iterative process of experimentation but also unlocks opportunities to address previously intractable problems, particularly those involving rapidly evolving datasets or complex, multi-faceted analyses. Consequently, innovation within the field is poised to accelerate as researchers and practitioners can focus on high-level problem definition and interpretation, rather than the intricacies of pipeline engineering, ultimately leading to more intelligent and adaptable machine learning solutions.

Stratum’s architecture embodies a philosophy of systematic exploration, mirroring the sentiment expressed by John McCarthy: “The best way to predict the future is to create it.” The system doesn’t merely execute ML pipelines; it actively searches for optimal configurations through agentic workflows. This approach aligns perfectly with the idea that understanding-and improving-reality requires actively manipulating its components. By representing these pipelines as lazily evaluated Directed Acyclic Graphs, Stratum facilitates a form of controlled experimentation, allowing agents to probe the solution space and, ultimately, construct a more desirable future for ML workloads. It’s not about passively accepting the limitations of current systems, but about rewriting the code itself.

What’s Next?

Stratum, in its attempt to tame the chaos of agentic pipelines, exposes a fundamental truth: the infrastructure itself becomes the constraint. The system confesses its design sins not in errors, but in the very optimizations it achieves. Lazily evaluated directed acyclic graphs, while elegant, merely shift the burden of failure-from immediate crash to subtly incorrect results propagated through a vast, opaque network. The true challenge isn’t scaling execution, but verifiable correctness at scale.

Further investigation must address the inevitable pathologies of agentic search. What safeguards prevent an agent from discovering a locally optimal, yet globally useless, workflow? How does one audit the reasoning of a pipeline constructed by a constantly evolving, self-modifying program? The system’s reliance on Python, while pragmatic, invites the usual performance demons; the question isn’t if they will surface, but when, and how drastically they will reshape the system’s cost function.

Ultimately, Stratum’s value lies not in its current capabilities, but in the questions it forces one to ask. It is a provocation-a demonstration that building systems for intelligence requires not just faster execution, but a deeper understanding of the limits of automation itself. The next iteration must abandon the illusion of control, embracing instead a framework for constrained exploration, where failure is not a bug, but a necessary data point in the reverse-engineering of intelligence.

Original article: https://arxiv.org/pdf/2603.03589.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Unraveling the Bottlenecks: Why Modern ML Pipelines Struggle

The Rise of the Autonomous Pipeline: Agentic AI and the LLM Revolution

Stratum: Engineering Efficiency for Agentic Pipeline Execution

Beyond Automation: Charting the Future of Intelligent Pipelines

What’s Next?

See also: