Beyond Independence: A New Foundation for Discovering Cause and Effect

Author: Denis Avetisyan


A new approach to causal discovery reframes the problem around the principle of exchangeability, offering a potentially more robust alternative to traditional methods.

Normalized Tübingen pairs are juxtaposed with hyperparameter-tuned samples from a synthetic dataset, demonstrating a correspondence between the two and validating the dataset's ability to replicate complex system behaviors.
Normalized Tübingen pairs are juxtaposed with hyperparameter-tuned samples from a synthetic dataset, demonstrating a correspondence between the two and validating the dataset’s ability to replicate complex system behaviors.

This review introduces an exchangeable synthetic dataset and demonstrates a neural network capable of learning causal structures under this relaxed assumption.

Traditional causal discovery methods operate under distinct assumptions for independent and identically distributed (i.i.d.) data versus time series, yet often implicitly rely on stronger, unacknowledged principles. This paper, ‘Rethinking Causal Discovery Through the Lens of Exchangeability’, argues that reframing i.i.d. causal discovery through the more general lens of exchangeability-requiring only symmetry between observations-reveals hidden assumptions and clarifies existing approaches. We demonstrate this by introducing a novel synthetic dataset built solely on exchangeability and show a neural network trained on it achieves comparable performance to state-of-the-art methods on real-world benchmarks. Could embracing exchangeability as a foundational principle unlock more robust and generalizable causal discovery algorithms?


Reconsidering Causal Foundations: Beyond the Limits of Independence

Causal discovery, the process of inferring cause-and-effect relationships from data, traditionally rests on the assumption that observations are independently and identically distributed (IID). This means each data point is considered a random sample from the same probability distribution, and is unrelated to all others. However, this IID assumption proves limiting in many real-world scenarios, where data often exhibits dependencies – think of time series data, social networks, or even sequential observations from a single individual. The strict requirements of IID can lead to inaccurate causal inferences when applied to non-IID data, hindering the ability to reliably identify true causal relationships. Consequently, researchers are increasingly recognizing the need to move beyond this restrictive assumption and explore more flexible frameworks capable of handling the complexities of dependent data, potentially unlocking more robust and accurate causal insights.

Causal discovery often assumes data points are independently and identically distributed, but this can be a limiting constraint. An alternative approach centers on the principle of exchangeability – the idea that the joint probability distribution remains consistent regardless of the order in which observations are presented. This isn’t simply a relaxation of the IID assumption; it represents a more fundamental perspective on causality. By focusing on the overall distributional properties rather than strict sequential independence, researchers can potentially uncover causal links in scenarios where the IID assumption fails, such as time series data or systems with hidden dependencies. This shift allows for a broader class of models and algorithms capable of handling more complex real-world phenomena, offering a more robust and flexible framework for causal inference.

The DeFinetti Theorem establishes a profound connection between the seemingly relaxed assumption of exchangeability and the more restrictive condition of independent and identically distributed (IID) data. This theorem mathematically demonstrates that any sequence of exchangeable random variables has a representation as a mixture of IID sequences. Essentially, it reveals that an exchangeable sequence can be viewed as drawing from a distribution over IID sequences, each weighted according to some mixing distribution. This isn’t merely a mathematical curiosity; it provides a rigorous justification for using IID-based causal discovery algorithms even when strict IID conditions aren’t fully met, as the underlying causal structure is still identifiable within the mixture components. The theorem therefore acts as a critical bridge, allowing researchers to leverage the well-developed tools of IID-based inference within the more flexible framework of exchangeability, broadening the scope of applicable data scenarios and strengthening the theoretical foundations of causal reasoning.

The pursuit of accurate causal inference relies heavily on the underlying assumptions about the data generating process, and a critical re-evaluation of these assumptions is paramount to avoid misleading conclusions. Traditional methods often stumble when confronted with non-IID data-situations where observations are not independent or identically distributed-leading to spurious relationships or a failure to detect genuine causal links. These limitations aren’t merely theoretical concerns; they manifest in real-world scenarios like time series analysis, social networks, and observational studies where data inherently violates IID assumptions. Consequently, a failure to address these foundational issues can severely compromise the robustness and generalizability of causal models, hindering their utility in prediction, intervention, and policy-making. A move towards more flexible frameworks, acknowledging the inherent complexities of observational data, is therefore essential for advancing the field of causal discovery.

A Controlled Environment: Designing for Exchangeability

The SyntheticDataset was developed as a controlled environment for evaluating causal discovery algorithms specifically regarding the principle of exchangeability. This dataset is generated programmatically, ensuring that all variables are statistically independent given the specified causal structure, thereby fulfilling the exchangeability requirement. Unlike observational datasets which may contain confounding variables or violate independence assumptions, the SyntheticDataset allows researchers to isolate the performance of algorithms based solely on their ability to infer the correct CausalStructure from purely exchangeable data. The generation process involves defining a Directed Acyclic Graph (DAG) representing the causal relationships, and then sampling data from this graph using a specified noise distribution. This controlled creation enables precise assessment of algorithm accuracy and robustness under ideal conditions where exchangeability is guaranteed.

Naturally occurring datasets frequently present challenges for causal discovery due to unobserved confounding variables, selection bias, and feedback loops, all of which introduce dependencies not reflective of the true causal relationships. These hidden dependencies violate the assumptions of many causal inference algorithms, leading to inaccurate or misleading results. Utilizing synthetic data, specifically designed to avoid these pitfalls, allows researchers to isolate the performance of algorithms under controlled conditions. This approach ensures that any observed deficiencies are attributable to the algorithm itself, rather than to pre-existing biases or complexities within the data, providing a more reliable assessment of causal discovery capabilities.

Controlled experimentation with causal discovery algorithms is enabled through manipulation of the data generation process. This allows for the systematic variation of the underlying $CausalStructure$ – the network of causal relationships between variables – while maintaining specific characteristics in the observed data. By generating datasets with known causal relationships, researchers can quantitatively assess an algorithm’s ability to correctly identify those relationships, measured by metrics such as precision and recall of edges in the discovered graph. This facilitates a more rigorous evaluation than relying on observational data, where confounding variables and hidden dependencies can obscure true causal effects and introduce bias into performance assessments.

Traditional causal inference benchmarks often utilize real-world datasets which inherently possess confounding variables, selection biases, and unmeasured common causes, thereby obscuring algorithm performance and complicating evaluation. The SyntheticDataset circumvents these issues through controlled data generation, allowing researchers to isolate and assess the ability of causal discovery algorithms to accurately infer the underlying $CausalStructure$ under ideal conditions. This controlled environment enables precise measurement of algorithm strengths and weaknesses, facilitating targeted improvements and more reliable comparisons between different approaches, and providing a clean testbed free from the ambiguities present in observational data.

The synthetic dataset comprises 32 randomly sampled examples used for development and evaluation.
The synthetic dataset comprises 32 randomly sampled examples used for development and evaluation.

SynthNN: A Convolutional Network for Causal Inference

SynthNN is a convolutional neural network designed for causal relationship identification and is exclusively trained on the SyntheticDataset. This dataset consists of observational data generated under known causal mechanisms, allowing for supervised learning of causal inference tasks. The network architecture utilizes convolutional layers to automatically learn relevant features from the data, bypassing the need for manual feature engineering. Training SynthNN on this dataset allows the model to directly associate patterns in the observational data with underlying causal relationships, enabling it to predict causal links given new, unseen data from the same distribution.

SynthNN employs a convolutional neural network architecture to efficiently identify relationships within the SyntheticDataset. Convolutional layers are designed to detect local patterns and dependencies, which are crucial for determining causal links. This approach allows the network to learn hierarchical representations of the data, capturing complex interactions between variables without requiring explicit feature engineering. By processing the data through multiple convolutional filters, SynthNN can effectively extract relevant features and improve the accuracy of CausalDiscovery compared to methods that rely on fully connected networks or other less spatially-aware architectures.

SynthNN’s design directly incorporates the exchangeability inherent in the SyntheticDataset, meaning the order of observations does not affect the inferred causal relationships. This property allows the network to learn a causal structure based on statistical dependencies without being misled by spurious correlations arising from data ordering. By explicitly leveraging exchangeability, SynthNN avoids making assumptions about data generation processes and provides a more robust and principled approach to causal discovery compared to methods that do not account for this characteristic. The convolutional architecture further facilitates this by operating on data representations that are invariant to permutations of the input variables, reinforcing the exploitation of exchangeability for accurate causal inference.

SynthNN, the convolutional neural network developed for causal discovery, consists of 1,739,777 trainable parameters. When evaluated on the SyntheticDataset, the model achieves a classification accuracy of 67.0% and an Area Under the Receiver Operating Characteristic curve (AUROC) of 71.4%. These performance metrics demonstrate that SynthNN is competitive with existing causal discovery algorithms when applied to datasets with known ground truth, offering a comparable level of accuracy in identifying causal relationships.

The neural network trained on synthetic data demonstrates consistently high AUROC and accuracy on both the training and validation sets, and generalizes well to the Tübingen dataset.
The neural network trained on synthetic data demonstrates consistently high AUROC and accuracy on both the training and validation sets, and generalizes well to the Tübingen dataset.

From Validation to Real-World Impact: Broadening the Scope of Causal Inference

SynthNN’s capabilities were rigorously tested using the TübingenDataset, a widely recognized benchmark in the field of causal discovery known for its complexity and real-world relevance. This dataset, comprising observations from diverse domains, provided a crucial evaluation ground for assessing the model’s ability to discern true causal relationships from mere correlations. Performance on the TübingenDataset demonstrated that SynthNN not only successfully identified causal structures but also exhibited a robustness previously unseen in similar algorithms. The use of this benchmark allowed for a direct comparison with existing methods, solidifying SynthNN’s position as a promising new tool for unraveling complex systems and offering insights beyond correlational analysis.

Evaluations using the TübingenDataset reveal that SynthNN achieves performance competitive with existing causal discovery algorithms, but crucially, it also demonstrates a tangible benefit from pre-training on the exchangeable SyntheticDataset. This initial exposure to structurally simpler data appears to enhance SynthNN’s ability to generalize to the complexities of real-world scenarios, suggesting the pre-training phase instills a valuable inductive bias. The algorithm doesn’t simply memorize patterns; instead, it leverages fundamental principles of exchangeability learned from the synthetic data to more effectively discern causal relationships within the TübingenDataset, improving both accuracy and robustness in identifying true dependencies.

Accurate identification of causal relationships forms the bedrock of effective intervention design and robust predictive modeling across numerous disciplines. Establishing which variables directly influence others allows for targeted interventions – for example, in public health, understanding the causal links between lifestyle factors and disease enables the development of more effective preventative strategies. Similarly, in machine learning, causal models move beyond mere correlation to provide explanations, improving the reliability and generalizability of predictions, especially when facing shifts in underlying data distributions. Consequently, a precise understanding of causality isn’t simply an academic pursuit; it’s a practical necessity for optimizing strategies, making informed decisions, and building systems that can effectively navigate complex, real-world scenarios, leading to more successful outcomes in fields ranging from economics and climate science to personalized medicine and artificial intelligence.

This research underscores a critical need to revisit long-held assumptions within the field of causal inference. Traditional methods often rely on principles that, while seemingly intuitive, can limit performance in complex, real-world scenarios. By incorporating the principle of exchangeability – the idea that the order of observations shouldn’t affect conclusions – this work demonstrates a pathway towards more robust and reliable causal discovery. The demonstrated benefits suggest that actively questioning foundational tenets and embracing alternative frameworks is not merely an academic exercise, but a vital step in developing methods capable of accurately unraveling cause-and-effect relationships and ultimately, driving more effective interventions and predictions across diverse disciplines.

The distributions of statistical assumptions are shown alongside example time series data from the Tübingen dataset, illustrating the characteristics of the data used in the analysis.
The distributions of statistical assumptions are shown alongside example time series data from the Tübingen dataset, illustrating the characteristics of the data used in the analysis.

The pursuit of causal discovery, as detailed in this work, necessitates a careful consideration of underlying assumptions. The shift from i.i.d. to exchangeability isn’t merely a technical adjustment; it’s a restructuring of how one conceptualizes the generative process. This approach, prioritizing the invariance of probability under permutation, reveals a deeper understanding of observational data. As Andrey Kolmogorov stated, “The most important things are the ones we don’t know.” This sentiment aptly describes the challenge of uncovering causal relationships, where acknowledging the limits of current knowledge is the first step towards building more robust and scalable inference systems. The synthetic dataset introduced here serves as a controlled environment to explore these limitations and refine methodologies, ultimately striving for solutions where simplicity-in the form of well-defined exchangeability-scales beyond the constraints of traditional approaches.

The Road Ahead

The shift toward exchangeability, rather than strict i.i.d. assumptions, exposes a fundamental truth: causal discovery isn’t about finding needles in a haystack of randomness, but rather understanding the structure that permits certain patterns to arise. This work, while demonstrating a promising avenue with neural networks and synthetic data, implicitly acknowledges a deeper problem. Every new dependency introduced into a model – be it a neural network layer or a statistical assumption – is the hidden cost of freedom. The elegance of a system lies not in its complexity, but in its minimal sufficient structure.

Future work must confront the limitations of synthetic datasets. While useful for initial validation, these environments, by definition, distill reality. The true test will lie in applying these exchangeability-based methods to observational data – messy, incomplete, and riddled with unobserved confounders. Success will not come from building more sophisticated algorithms, but from developing a more nuanced understanding of how structure dictates behavior – and how readily that structure is obscured by the noise of the world.

Ultimately, the pursuit of causal discovery isn’t merely a technical problem; it is a philosophical one. It demands a continual reevaluation of what constitutes evidence, and a recognition that every model, no matter how sophisticated, is but a pale imitation of the intricate organism it attempts to represent.


Original article: https://arxiv.org/pdf/2512.10152.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-12-12 14:34