Time Series Without the Hassle: A New Approach to Stream Computing

Author: Denis Avetisyan

A novel framework unifies batch and streaming computation for time series data, prioritizing data consistency and efficient performance.

A complex financial trading pipeline-encompassing feature engineering, machine learning prediction, portfolio optimization, and order execution-is fundamentally constrained by cross-sectional dependencies requiring simultaneous access to all assets, ultimately dictating the system’s minimum latency as determined by the critical path within the processing graph.

Causify DataFlow enables high-performance machine learning on time series through causality, point-in-time idempotency, and tiling.

Traditional machine learning workflows often struggle transitioning from finite, batch-processed datasets to the continuous stream of unbounded time-series data, introducing issues of causality and reproducibility. This paper presents ‘Causify DataFlow: A Framework For High-performance Machine Learning Stream Computing’, a novel system designed to unify batch and streaming computation through a directed acyclic graph-based execution model. By enforcing point-in-time idempotency and strict causality-ensuring outputs depend only on preceding context-DataFlow allows models developed offline to execute identically in real-time production without code modification. Could this framework unlock a new era of reliable and scalable real-time analytics and decision-making systems?

The Inevitable Lag: Why Batch Just Doesn’t Cut It

Historically, data analysis relied on batch processing – collecting data over a period, then processing it as a discrete unit. However, the sheer velocity and volume of contemporary data streams – think social media feeds, sensor networks, or financial transactions – overwhelm this approach. By the time batch processes complete, the insights derived are often outdated and lack relevance for time-sensitive applications. This lag between data generation and analysis creates a critical challenge, as decisions informed by stale data can lead to missed opportunities or incorrect conclusions. Modern systems require a shift toward continuous computation, capable of processing data as it arrives to deliver actionable intelligence in near real-time, effectively circumventing the limitations of traditional methods.

The rise of real-time applications – from fraud detection and algorithmic trading to personalized recommendations and IoT sensor networks – presents a fundamental challenge to traditional data processing architectures. These systems require continuous computation on unbounded data streams, meaning data arrives without a defined beginning or end, and potentially at infinite velocity. This contrasts sharply with batch processing, designed for finite datasets, and necessitates a shift towards systems capable of handling data as it arrives, rather than accumulating it for later analysis. Achieving this demands innovative approaches to state management, fault tolerance, and scalability, as maintaining consistent and accurate results across a perpetually updating dataset introduces significant architectural hurdles. Consequently, developers are increasingly focused on stream processing frameworks and distributed systems designed to ingest, process, and analyze data with minimal latency, effectively turning data into actionable insights as it flows through the system.

DataFlow: Declarative Computation, Finally

DataFlow employs a declarative programming paradigm for constructing streaming applications, meaning developers specify what computations should be performed rather than how to perform them. This is achieved through the definition of a directed acyclic graph (DAG) where nodes represent operations and edges define data dependencies between those operations. Each node receives data from its predecessors, performs a specified transformation, and passes the results to its successors. The DAG structure allows DataFlow to automatically parallelize computations and optimize data flow, enabling efficient and scalable stream processing. By explicitly defining these dependencies, the system can determine the correct execution order and handle out-of-order data arrival without requiring explicit state management in the application code.

DataFlow’s fundamental operational unit is the time-indexed dataframe, representing a continuous stream of data associated with specific timestamps. This structure allows DataFlow to process data incrementally as it becomes available, avoiding the need to load entire datasets into memory. Each dataframe within the stream contains data points and their corresponding timestamps, which are utilized to maintain data order and enable time-based operations such as windowing and aggregation. The system continuously monitors incoming data, appending new dataframes to the stream and triggering computations on these incrementally updated streams, thus facilitating real-time or near real-time data processing.

DataFlow accommodates both Streaming Mode and Batch Mode processing to support a range of application requirements. In Streaming Mode, data is processed continuously as it arrives, offering low latency for real-time analytics and event-driven systems. Conversely, Batch Mode processes a defined, finite dataset, optimizing for throughput and resource utilization in scenarios like historical data analysis or large-scale ETL pipelines. The system dynamically adapts to these modes based on the input data source and configured parameters, enabling a unified framework for both continuous and discrete data processing tasks.

Nodes <span class="katex-eq" data-katex-display="false">C1</span> and <span class="katex-eq" data-katex-display="false">C2</span> in this directed acyclic graph can execute concurrently due to the absence of dependencies between them. — Nodes $C1$ and $C2$ in this directed acyclic graph can execute concurrently due to the absence of dependencies between them.

Determinism Through Restriction: A Necessary Evil

DataFlow achieves Point-in-Time Idempotency by restricting computational dependencies to a fixed temporal window, defined as the Context Window of length $L$ . This means that for any given point in the data stream, the result of a computation is solely determined by the $L$ preceding data points. This constraint eliminates the influence of future or external data, ensuring that re-execution with the same input within that window will always yield the same result. The Context Window represents a bounded historical view, critical for predictable and reproducible data processing, and forms the basis for ensuring deterministic behavior within the DataFlow system.

DataFlow utilizes Tiling to divide continuous data streams into discrete, manageable segments called Tiles for processing. This partitioning enables Mini-Batch Execution, increasing throughput and resource utilization. Correct execution necessitates that the Tile size, denoted as τ, is greater than or equal to the Context Window length, L. This constraint, $τ \geq L$ , ensures that each Tile contains all necessary historical data required for computations defined by the Context Window, preventing data dependencies on future, unavailable information and maintaining deterministic results.

Causality Enforcement within DataFlow prevents computations from utilizing data that has not yet occurred in the stream, thereby ensuring results are strictly dependent on past inputs. This is achieved through the use of Knowledge Time, a mechanism for tracking data availability, and is formalized by the Graph Context Window, denoted as $w(G)$ . $w(G)$ specifies the minimum tile size – the amount of stream data processed in a single operation – necessary to guarantee correct execution; tiles smaller than $w(G)$ may lead to inaccurate or inconsistent results due to incomplete information. Consequently, tile size τ must be greater than or equal to $w(G)$ ( $τ \geq w(G)$ ) to maintain causal consistency and prevent ‘future peeking’.

The accumulated path context <span class="katex-eq" data-katex-display="false">w(p) = 1 + \sum(w_{i}-1)</span> in this linear chain demonstrates how each node's local context window contributes to a growing, cumulative understanding of the path. — The accumulated path context $w(p) = 1 + \sum(w_{i}-1)$ in this linear chain demonstrates how each node’s local context window contributes to a growing, cumulative understanding of the path.

Beyond Deployment: Proactive System Health

DataFlow’s foundational architecture is designed to accommodate the dynamic nature of real-world data and ensure continuous model performance assessment. The system doesn’t treat model evaluation as a one-time event; instead, it integrates ongoing monitoring and testing directly into the streaming pipeline. This is achieved through a modular design that allows for the seamless introduction of new data distributions, edge cases, and adversarial examples. By continuously comparing model outputs against ground truth – even under shifting conditions – DataFlow identifies performance degradation, concept drift, and potential biases. The architecture supports A/B testing of model versions, shadow deployments for risk-free evaluation, and automated retraining triggers, fostering a proactive approach to maintaining model accuracy and reliability over time. This constant vigilance ensures that models remain effective and trustworthy, even as the underlying data evolves.

To ensure consistently reliable performance, DataFlow employs Historical Simulation, a technique that meticulously recreates past production environments for rigorous benchmarking. This process isn’t simply re-running tests; it involves feeding the system with actual, archived data streams that mirror the volume, velocity, and variety of real-world inputs encountered during prior operational periods. By replicating these conditions, developers and operations teams can proactively identify potential bottlenecks, assess the impact of code changes before deployment, and fine-tune system parameters for optimal efficiency. The ability to accurately simulate production loads allows for a precise measurement of throughput, latency, and resource utilization, ultimately guaranteeing that the system will perform as expected – and even under stress – when facing genuine user demand.

DataFlow incorporates sophisticated debugging tools designed to dramatically accelerate troubleshooting and streamline system maintenance. The platform permits the extraction of internal node values within a data pipeline, offering granular visibility into data transformations at each stage of processing. This capability allows engineers to pinpoint the exact source of errors or performance bottlenecks with unprecedented speed and accuracy, moving beyond simple input-output analysis. By exposing these intermediate data states, complex streaming pipelines become substantially more manageable, facilitating rapid iteration, targeted optimization, and proactive identification of potential issues before they impact production systems. The result is a significant reduction in mean time to resolution and improved overall system reliability.

DataFlow’s configuration system streamlines the deployment and ongoing management of intricate streaming data pipelines. By abstracting away much of the underlying complexity, the system allows engineers to rapidly prototype, deploy, and scale applications without being burdened by infrastructure concerns. This simplification translates directly into performance gains; automated resource allocation and optimized execution plans contribute to significantly increased throughput, enabling the processing of larger data volumes. Furthermore, the configuration system actively minimizes latency by intelligently scheduling tasks and reducing overhead, ensuring timely delivery of insights from rapidly moving data streams. The result is a robust and efficient platform capable of handling demanding real-time analytics workloads with ease.

The pursuit of elegant frameworks invariably leads to pragmatic compromises. Causify DataFlow, with its emphasis on point-in-time idempotency and tiling for time series data, attempts to reconcile the ideal of reproducible computation with the messy reality of stream processing. It’s a noble effort, yet one can anticipate the inevitable accumulation of technical debt as production systems encounter edge cases the framework didn’t foresee. As Andrey Kolmogorov observed, “The most important thing in science is not knowing, but knowing what you don’t know.” This framework attempts to define what can be known about a data stream, but the bug tracker will, undoubtedly, document what remains stubbornly unknowable. They don’t deploy – they let go.

What Comes Next?

The promise of unified batch and streaming, particularly for time series, always feels… ambitious. This framework rightly focuses on causality and idempotency – concepts frequently glossed over in the rush to ingest and model everything. Yet, the inevitable questions linger. Point-in-time correctness is a beautiful ideal, but production data rarely cooperates. Expect edge cases involving late-arriving data, network partitions, and the delightful surprises of sensor drift to quickly surface. The tiling strategy, while potentially efficient, introduces its own complexities; managing tile boundaries and ensuring consistent windowing across distributed systems is rarely as clean as the diagrams suggest.

The true test won’t be benchmark performance on synthetic datasets, but rather the system’s behavior under sustained, unpredictable load. A framework can elegantly enforce causality, but it cannot prevent bad data from being ingested in the first place. Future work will undoubtedly involve grappling with data quality monitoring, anomaly detection, and the ongoing cost of maintaining reproducibility as datasets grow.

Ultimately, this is a step toward more robust stream processing. However, it’s worth remembering that every carefully crafted abstraction adds another layer where things can – and will – go wrong. The goal isn’t to eliminate failure, but to contain it. Tests are, after all, a form of faith, not certainty.

Original article: https://arxiv.org/pdf/2512.23977.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inevitable Lag: Why Batch Just Doesn’t Cut It

DataFlow: Declarative Computation, Finally

Determinism Through Restriction: A Necessary Evil

Beyond Deployment: Proactive System Health

What Comes Next?

See also: