Unifying AI’s Languages: A Path to Smarter Data Processing

Author: Denis Avetisyan

A new declarative intermediate language aims to bridge the gap between diverse data analytics approaches, paving the way for more efficient and reusable optimization techniques.

Hojabr offers a pathway beyond the walled gardens of contemporary platforms, establishing a language for interoperability that anticipates the inevitable fragmentation of systems and hardware architectures.

Hojabr introduces a unified framework for relational, tensor, and graph processing through bidirectional compilation and a common intermediate representation.

Modern data analytics increasingly integrates diverse computational paradigms-relational, graph, and tensor processing-yet remains hampered by fragmented systems and duplicated optimization efforts. This paper introduces Hojabr: Towards a Theory of Everything for AI and Data Analytics, proposing a unified declarative intermediate language to address this challenge. Hojabr seamlessly integrates these paradigms within a higher-order algebraic framework, enabling bidirectional compilation and constraint-based optimization choices. Could this approach unlock systematic reuse of optimization techniques across the entire data analytics and machine learning stack, fostering a new level of interoperability and performance?

The Inevitable Fragmentation of Data: A System’s Lament

The current landscape of data analytics is characterized by a proliferation of specialized systems, each employing its own distinct language and protocols. This inherent fragmentation poses a substantial challenge to interoperability, as data must constantly be translated and reshaped to move between these disparate environments. Consequently, organizations face significant overhead in data movement, increased complexity in system integration, and limitations in scalability. The inability for these systems to readily communicate restricts the potential for holistic data insights and hinders the development of truly unified analytical workflows. This situation demands a move toward more standardized and universally compatible approaches to data representation and processing.

The proliferation of specialized data systems, while innovative in their individual capacities, introduces substantial inefficiencies due to the necessity of constant data transfer and translation. Each system typically demands a unique data format and communication protocol, compelling analysts to expend considerable resources simply moving and converting information between them. This overhead isn’t merely a matter of wasted time; it fundamentally limits performance, particularly when dealing with large datasets and complex analytical pipelines. As data volumes grow exponentially, the costs associated with these translations – in terms of both computational power and latency – become increasingly prohibitive, hindering scalability and preventing organizations from fully leveraging the potential of their data assets. The constant need for adaptation and reconciliation between disparate systems effectively creates bottlenecks that slow down insights and stifle innovation.

The inherent complexity of modern data analytics stems from a lack of universal translation between its varied systems; a declarative intermediate language proposes a powerful solution by establishing a common representational framework. Instead of forcing direct communication between disparate tools, this approach allows all workloads – be they database queries, machine learning model training, or tensor computations – to be expressed in a single, unified language. This intermediate form then acts as a translator, optimizing and executing tasks across different platforms without the costly overhead of repeated data movement and conversion. By decoupling the what – the desired computation – from the how – the specific implementation on a given system – a declarative language enables greater flexibility, scalability, and ultimately, more efficient data processing.

Hojabr presents a solution to the challenges of fragmented data analytics through the creation of a unified language designed to streamline integration and enhance optimization. This innovative approach transcends the traditional boundaries separating database, machine learning, and tensor systems by providing a common representational framework. The core achievement of Hojabr lies in its ability to bridge these disparate systems, allowing for seamless data exchange and computation without costly translation or movement. Consequently, analytical workflows become more efficient, scalable, and accessible, fostering a more cohesive and powerful data analytics ecosystem. This unification isn’t merely about compatibility; it unlocks opportunities for novel optimizations and hybrid approaches previously hindered by linguistic and structural barriers.

Hojabr utilizes a bidirectional compilation workflow, enabling programs to leverage both its multi-paradigm optimizer and existing optimizers for declarative languages, a capability not available to lower-level languages which serve only as compilation targets.

The Logic of Constraint: Defining the System’s Boundaries

Hojabr employs both logic systems and linear programming techniques to establish and maintain constraints governing data and computational processes. Logic systems, specifically those inspired by Datalog, are used to define declarative constraints on data relationships and transformations. Simultaneously, linear programming provides a mechanism for optimizing computations while adhering to these constraints, effectively solving for values that satisfy specified logical conditions and algebraic equations. This dual approach allows Hojabr to not only express constraints – such as data type limitations or relational dependencies – but also to enforce them through optimization algorithms, ensuring data integrity and computational correctness. The combination facilitates the creation of robust and verifiable data pipelines where computations are demonstrably compliant with predefined rules.

Hojabr employs Higher-Order Relations to represent data structures where relationships can exist between relationships, enabling the construction of complex, nested data. Specifically, tensors and bags are modeled using Semiring-Based Relations, which extend traditional relational algebra by incorporating semiring operations like addition and multiplication. This allows for the representation of data with varying multiplicities and the application of algebraic operations to these complex structures. For example, a tensor can be represented as a relation where elements are associated with indices, and semiring operations enable operations like tensor contraction and element-wise multiplication. Bags, or multisets, are naturally represented as relations where the multiplicity of each element is tracked via semiring addition. $R \subset eq D_1 \times D_2 \times ... \times D_n$ represents a relation over domains $D_i$ .

The Hojabr constraint system is fundamentally based on the principles of Datalog, a declarative logic programming language. This allows data transformations to be expressed as logical rules, defining relationships between input and output data. These rules are then optimized using techniques derived from database query optimization, including rule rewriting, indexing, and the application of algebraic identities. The system utilizes a bottom-up evaluation strategy, iteratively applying rules until a fixed point is reached, effectively transforming the input data into the desired output. This approach enables efficient execution of complex data pipelines and facilitates automated reasoning about data integrity and consistency.

Hojabr’s architecture integrates logical reasoning with algebraic operations to achieve both flexibility and efficiency in data processing. The system employs logic-based constraints to define permissible data states and computational steps, enabling adaptable workflows and data validation. Simultaneously, the use of linear programming and semiring-based relations facilitates efficient numerical computation and manipulation of complex data structures like tensors. This dual approach allows Hojabr to not only express intricate data dependencies and transformations – akin to a declarative programming paradigm – but also to leverage optimized algebraic solvers for performance, resulting in a system capable of handling diverse computational tasks with improved resource utilization.

The Dance of Joins: Adapting to the Shape of Data

Hojabr incorporates a diverse set of join algorithms to address varying data characteristics and workload demands. The system supports standard techniques such as $Hash Join$ , which excels with in-memory datasets, and $Sort-Merge Join$ , effective for large, disk-resident relations. Beyond these, Hojabr implements more specialized approaches including $Free Join$ , designed for scenarios with limited memory, $Diamond Join$ , which leverages data partitioning for parallel execution, and $Worst-Case Optimal Join$ , guaranteeing performance even with unfavorable data distributions. The availability of multiple algorithms is central to Hojabr’s optimization capabilities, enabling it to select the most appropriate method for each specific query.

Within Hojabr, join algorithms – including Hash Join, Sort-Merge Join, and others – are not implemented as standalone procedural code. Instead, each algorithm is defined as a relational algebraic operator. This representation facilitates query optimization because the system can treat join algorithms as interchangeable components within a larger query plan. Consequently, Hojabr’s optimizer can explore various combinations of these operators, applying transformations such as reordering, pushing predicates, and selecting the most cost-effective algorithm based on data statistics and system characteristics. This algebraic representation enables comprehensive, cost-based optimization that extends beyond simple algorithm selection to include algorithm tuning and plan rewriting.

Hojabr’s declarative approach to query processing enables automated join algorithm selection and tuning based on workload characteristics. Rather than relying on pre-defined execution plans or manual optimization, the system analyzes query predicates, data statistics, and available resources to dynamically choose the most efficient join method-including `Hash Join`, `Sort-Merge Join`, and others-at runtime. This automated process considers factors such as data size, distribution, and the presence of relevant indexes to optimize join order and algorithm parameters. Consequently, Hojabr minimizes operator cost and improves overall query performance without requiring explicit user intervention or hand-tuned configurations.

Hojabr’s declarative approach to query optimization yields performance and scalability benefits over traditional, imperative database systems. Imperative systems typically rely on a fixed execution plan determined by database administrators or hardcoded logic, limiting adaptation to varying data distributions and query characteristics. In contrast, Hojabr’s declarative model allows the system to dynamically select and tune join algorithms – including `Hash Join`, `Sort-Merge Join`, and others – based on cost estimation and runtime statistics. This automatic optimization reduces reliance on manual tuning, enables more efficient resource allocation, and facilitates improved performance across diverse workloads and data volumes, resulting in significantly lower query latency and increased throughput.

Beyond the Present: A System Designed for Growth

Hojabr distinguishes itself through its compilation to Differential Data Flow (DDF), a technique that unlocks portability and performance across a diverse range of hardware. This compilation process doesn’t simply translate code; it restructures computations into a graph optimized for parallel execution on CPUs, GPUs, and specialized accelerators. By targeting DDF, Hojabr effectively decouples the program logic from the underlying hardware, enabling it to adapt to emerging technologies without requiring substantial code modifications. The result is a system capable of achieving high performance while maintaining broad compatibility – a critical advantage in the rapidly evolving landscape of machine learning infrastructure, allowing researchers and engineers to deploy models on the most suitable platform without being locked into specific hardware vendors or architectures.

Hojabr’s design incorporates a bidirectional compilation capability, facilitating a fluid exchange of data and logic with established programming languages. This isn’t a one-way conversion; rather, programs can be translated from Hojabr into languages like Python or Java, enabling integration with existing infrastructure and libraries, and conversely, code written in those languages can be compiled into Hojabr to leverage its unified tensor representation and potential performance optimizations. This two-way street promotes interoperability, allowing developers to gradually adopt Hojabr without requiring a complete rewrite of existing systems, and also allows the broader ecosystem to benefit from Hojabr’s unique capabilities in handling both dense and sparse data formats.

Hojabr’s design incorporates Substrait, a language-agnostic intermediate representation for data processing, allowing it to tap into a wealth of pre-existing optimization techniques and tools. This strategic integration bypasses the need to reinvent established methods for query planning, code generation, and execution, accelerating development and enhancing performance. By adopting Substrait, Hojabr isn’t operating in isolation; it’s joining a rapidly expanding ecosystem where innovations in one system can directly benefit others. This collaborative approach ensures Hojabr remains at the forefront of data processing efficiency, continually leveraging advancements from a diverse community of developers and researchers – ultimately fostering a more robust and adaptable computational framework.

Hojabr distinguishes itself by seamlessly handling both dense and sparse tensor computations, a critical advancement for modern machine learning applications. Traditional systems often struggle with the disparate representations of these data types, requiring significant overhead for conversion and limiting performance. Hojabr, however, unifies these representations, allowing computations to flow naturally between dense and sparse data without costly transformations. This core achievement broadens the system’s applicability, enabling efficient execution across a wider range of machine learning workloads-from large language models reliant on dense tensors to recommendation systems that benefit from the memory efficiency of sparse data-and ultimately reducing the need for specialized infrastructure or code modifications.

The pursuit of Hojabr, as detailed in the paper, embodies a recognition that systems aren’t built so much as they evolve. The authors attempt to unify disparate data analytics paradigms – relational, tensor, and graph processing – not through rigid architectural decree, but by creating an intermediate representation that anticipates future needs. This mirrors a fundamental truth: any attempt at ‘perfect architecture’ is ultimately a denial of entropy. As Linus Torvalds observed, “Talk is cheap. Show me the code.” Hojabr, therefore, isn’t merely a theoretical construct; it’s a practical response to the inevitable decay of existing systems, seeking to provide a foundation for reusable optimization techniques and bidirectional compilation that can adapt to changing requirements.

What Lies Ahead?

Hojabr proposes a unification through declarative means, a tempting vision. Every such promise, however, carries the scent of future operational complexity. The ambition to represent relational, tensor, and graph processing within a single intermediate representation is not inherently flawed, but it invites a new class of failures-failures of translation, of optimization, and ultimately, of maintainability. The true test won’t be in demonstrating initial performance gains, but in observing the system’s behavior a decade hence, under the weight of accumulated patches and unforeseen use cases.

The notion of bidirectional compilation is particularly intriguing, suggesting a path toward portable optimization. Yet, one suspects this portability will prove asymptotic, always just beyond reach. Each backend will inevitably demand its own quirks, its own concessions to the iron laws of hardware. The language itself becomes a negotiator, mediating between the ideal of abstraction and the brutal reality of implementation.

Ultimately, Hojabr’s success will hinge not on achieving a ‘theory of everything,’ but on embracing the inevitability of imperfection. Order is merely a temporary cache between failures. The most resilient systems are not those that attempt to eliminate chaos, but those that learn to anticipate, contain, and even harness it. The real innovation may not be the language itself, but the methodologies it inspires for managing the entropy of data analytics at scale.

Original article: https://arxiv.org/pdf/2512.23925.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inevitable Fragmentation of Data: A System’s Lament

The Logic of Constraint: Defining the System’s Boundaries

The Dance of Joins: Adapting to the Shape of Data

Beyond the Present: A System Designed for Growth

What Lies Ahead?

See also: