Decoding Neural Networks: From Weights to Working Code

Author: Denis Avetisyan

Researchers have developed a new architecture that directly extracts human-readable algorithms from the hidden layers of discrete transformers.

A novel framework extracts executable algorithms from Discrete Transformers by leveraging temperature annealing to encourage interpretable discretization, characterizing attention patterns with hypothesis testing, approximating arithmetic transformations via symbolic regression, and integrating these components to synthesize Python code-successfully demonstrated through the recovery of the parity\_last2 algorithm and its implementation of XOR logic.

This work introduces the Discrete Transformer, enabling the automated recovery of symbolic programs from trained neural networks through disentanglement and symbolic regression.

While deep neural networks excel at complex tasks, their internal logic remains largely opaque, hindering our ability to understand how they arrive at solutions. This limitation motivates research into algorithm extraction, and in ‘Weights to Code: Extracting Interpretable Algorithms from the Discrete Transformer’, we introduce a novel architecture designed to synthesize human-readable programs directly from a trained network. By enforcing functional disentanglement and employing temperature-annealed sampling, the Discrete Transformer facilitates the extraction of symbolic expressions with performance comparable to recurrent baselines-extending interpretability to continuous variable domains. Could this approach unlock a new era of mechanistic interpretability, allowing us to not only predict what a neural network does, but also why?

Unveiling the Black Box: The Challenge of Deep Network Interpretability

Despite achieving state-of-the-art results in diverse fields, deep neural networks often function as ‘black boxes’, presenting a significant challenge to both trust and continued development. This opacity stems from the complex interplay of millions, even billions, of parameters learned during training, making it difficult to discern why a network arrives at a particular decision. Without understanding the internal logic, verifying the robustness of these systems becomes problematic – a small, adversarial perturbation to the input could trigger unexpected and potentially harmful outputs. Furthermore, refining network performance relies heavily on trial and error, a process greatly hindered by the inability to directly diagnose and address the root causes of errors. Consequently, the lack of interpretability not only limits the deployment of deep learning in critical applications but also impedes the progress of the field itself, hindering the development of more efficient, reliable, and trustworthy artificial intelligence.

While mechanistic interpretability techniques, notably Sparse Autoencoders, have begun to dissect the “black box” of deep neural networks, their capacity to fully explain complex behaviors remains limited. These methods excel at identifying relatively simple features and circuits within a network, offering a window into how individual neurons or small groups contribute to specific tasks. However, as networks scale in size and complexity, emergent behaviors – those not explicitly programmed but arising from the interaction of numerous components – prove exceedingly difficult to trace back to understandable mechanisms. The continuous nature of the internal representations learned by these networks further complicates matters; unlike the discrete steps of a traditional computer program, deep networks operate on gradients and probabilities, making it challenging to pinpoint the exact logic driving a particular decision. Consequently, current interpretability tools often provide fragmented insights, struggling to capture the holistic and dynamic interplay responsible for the network’s overall intelligence.

The very power of deep neural networks arises from their ability to model complex relationships using continuous variables, yet this same characteristic presents a significant hurdle to understanding how they arrive at decisions. Unlike traditional algorithms built on discrete logic – where operations are clearly defined and traceable – the internal representations within deep networks are fluid and distributed. This continuity makes it difficult to pinpoint specific features or concepts that trigger particular outputs; the network doesn’t operate with easily identifiable “if-then” statements. Instead, information is encoded as patterns of activation across numerous interconnected nodes, creating a high-dimensional, analog landscape where the boundaries between concepts are blurred. Consequently, even when a network performs well, the underlying computational logic remains obscured, hindering efforts to debug, refine, or truly trust its reasoning process. Researchers increasingly believe that bridging the gap between continuous representation and discrete logic is essential for achieving genuine interpretability in deep learning.

A Shift in Paradigm: Introducing Discrete Transformers

Traditional transformer models rely on continuous representations for both data and computations, which can lead to inefficiencies and difficulties in interpretability. The Discrete Transformer architecture circumvents these limitations by explicitly enforcing discrete computations. This is achieved through the use of discrete activation functions and routing mechanisms, effectively quantizing internal representations and operations. By restricting computations to a finite set of discrete values, the model reduces computational complexity and facilitates more targeted analysis of information flow. This approach contrasts with the continuous, differentiable nature of standard transformers, allowing for potentially improved scalability and interpretability without sacrificing representational capacity.

The Discrete Transformer architecture utilizes functional disentanglement, a design principle that segregates information routing mechanisms from computational arithmetic. This separation is achieved by implementing distinct modules responsible for directing data flow and performing calculations, analogous to the principles observed in Restricted Access Sequence Processing (RASP). In RASP, access to memory and processing elements is controlled by a separate routing network, preventing direct data dependencies between computation and addressing. Similarly, the Discrete Transformer’s disentangled design allows for independent optimization and analysis of routing and arithmetic components, potentially improving efficiency and interpretability by isolating the functions of data manipulation and data flow control.

The functional disentanglement within Discrete Transformers enables granular control and observation of individual computational elements. By isolating information routing from arithmetic operations, specific modules responsible for particular aspects of processing can be independently analyzed, modified, or bypassed without affecting the entire network. This modularity facilitates targeted interventions – such as ablating specific modules to assess their contribution, or substituting them with alternative implementations – and simplifies debugging processes by narrowing the scope of potential errors. Furthermore, this architecture supports the development of specialized modules tailored to specific tasks, allowing for a more efficient use of computational resources and improved model performance.

Smoothing the Path: Enabling Discrete Computation Through Transition

The architecture utilizes a Smooth Transition Mechanism to address optimization challenges inherent in discrete computation. This mechanism functions by linearly interpolating between the Gumbel-Softmax and Gumbel-Sparsemax distributions during training. Gumbel-Softmax provides a differentiable approximation of a categorical distribution, enabling gradient-based optimization, while Gumbel-Sparsemax encourages sparsity in the generated samples. By smoothly transitioning between these two distributions, the network avoids abrupt shifts in the optimization landscape, promoting stability and allowing for effective learning of discrete representations. The interpolation is parameterized, allowing control over the degree of sparsity and differentiability throughout the training process, and effectively bridging the gap between continuous relaxation and discrete inference.

Temperature annealing is implemented to progressively increase the sparsity of the network’s representations during training. This is achieved by systematically reducing the temperature parameter within the Gumbel-Softmax/Sparsemax distribution, which encourages the selection of fewer, more dominant features. As the temperature decreases, the probability mass concentrates on a smaller subset of possible discrete choices, effectively driving the network towards a more structurally simplified and interpretable state. This gradual increase in sparsity, guided by temperature annealing, allows the model to refine its computations and ultimately converge on discrete representations without abrupt shifts that might destabilize the optimization process.

The architecture’s smooth transition mechanism enables learning of interpretable computations by avoiding entrapment in local minima during optimization. Training dynamics reveal a distinct phase transition characterized by a decrease in loss prior to a decrease in discrepancy; this indicates that functional convergence – the network learning to perform the desired task – occurs before structural discretization – the stabilization of sparse, discrete representations. This sequence suggests the network initially optimizes for performance, then refines its internal structure to efficiently implement the learned functionality, offering improved stability and interpretability compared to direct discretization methods.

Training dynamics reveal a phase transition where initial loss decreases precede a more pronounced reduction in discrepancy, both converging as agreement approaches 1.0 during temperature annealing from 10.0 to 1.0 across spring, sum<span class="katex-eq" data-katex-display="false">\_</span>last2, diff<span class="katex-eq" data-katex-display="false">\_</span>last2, parity<span class="katex-eq" data-katex-display="false">\_</span>last2, and freebody tasks. — Training dynamics reveal a phase transition where initial loss decreases precede a more pronounced reduction in discrepancy, both converging as agreement approaches 1.0 during temperature annealing from 10.0 to 1.0 across spring, sum $\_$ last2, diff $\_$ last2, parity $\_$ last2, and freebody tasks.

Extracting Logic: Recovering Algorithms from Neural Networks

The Discrete Transformer architecture facilitates a novel approach to understanding the inner workings of neural networks by enabling the application of symbolic regression to the numerical Multi-Layer Perceptron (MLP) modules within. Unlike traditional neural network analysis which often yields opaque weights and activations, this technique allows the recovery of explicit, human-readable programs that define the network’s computations. By treating the MLP modules as black boxes and employing symbolic regression, researchers can effectively “dissect” the learned functions, revealing concise mathematical expressions or algorithmic logic. This process not only provides insights into how a network solves a particular task, but also generates interpretable code that can be verified and potentially reused, bridging the gap between complex neural models and understandable algorithmic representations.

The architecture employs Numerical Attention not simply as a weighting of inputs, but as a decisive routing mechanism – a ‘hard’ attention where specific inputs are actively selected based on numerical comparisons. Researchers subjected the attention patterns to rigorous hypothesis testing, moving beyond mere visualization to statistically validate observed behaviors. This analysis consistently revealed that attention focused on identifying either fixed offsets – consistently selecting inputs a certain distance apart – or windowed extrema, pinpointing inputs representing local maxima or minima within a defined range. These predictable patterns suggest the network isn’t engaging in arbitrary association, but instead implementing rudimentary forms of data selection akin to conditional statements in conventional programming, allowing for the subsequent decoding of learned algorithms.

A significant outcome of this research is the demonstrated ability to consistently recover complete algorithmic programs from within the Discrete Transformer architecture. Across a range of tested computational tasks, the extraction process achieved 100% success, indicating that the learned representations are, in fact, encoding explicit and recoverable algorithms. This isn’t merely an approximation of functionality, but a precise reconstruction of the underlying program logic. Further bolstering this claim is the resulting performance of these extracted programs – they consistently achieve near-zero test loss, mirroring the accuracy of the original neural network. This validates the architectural principles guiding the Discrete Transformer and establishes the feasibility of program synthesis directly from learned representations, offering a pathway towards interpretable and verifiable artificial intelligence.

Towards Algorithmically Transparent AI: A Vision for the Future

The pursuit of artificial intelligence often presents a trade-off between performance and interpretability; deep learning models, while powerful, frequently operate as ‘black boxes’. However, the Discrete Transformer represents a significant step towards resolving this tension. This novel architecture deliberately integrates principles from symbolic AI – systems that rely on explicit rules and representations – with the learning capabilities of deep neural networks. By forcing the model to reason through discrete, understandable states, rather than continuous, opaque activations, it generates outputs that are inherently more transparent. This allows for direct inspection of the model’s reasoning process, enabling developers and users to understand why a particular decision was reached, rather than simply observing what decision was made. The resulting systems are not only more trustworthy, as their internal logic can be audited and verified, but also more readily debugged and refined, paving the way for increasingly reliable and responsible AI applications.

Algorithmically transparent models, born from the confluence of deep learning and symbolic AI, offer a significant advantage in practical application: they are inherently auditable, debuggable, and readily refined. Unlike the ‘black box’ nature of many contemporary AI systems, these models expose their reasoning processes, allowing developers and stakeholders to trace decisions and identify potential biases or errors. This level of transparency isn’t merely academic; it’s crucial for responsible deployment in sensitive areas like healthcare, finance, and criminal justice. The ability to pinpoint the source of an incorrect prediction facilitates targeted improvements and builds trust in the system’s reliability. Consequently, this fosters a cycle of innovation, where iterative refinement based on clear understanding leads to increasingly robust and trustworthy AI solutions, accelerating progress while mitigating potential risks.

Recent advancements demonstrate that neural networks are no longer limited to static pattern recognition, but possess a growing capacity to model and decipher the underlying principles of complex, dynamic systems. This research reveals how Discrete Transformers, a novel neural network architecture, can effectively represent and predict continuous phenomena – those changing over time – offering a unique lens through which to investigate Continuous Dynamics. By dissecting these systems into discrete, understandable components, the algorithm provides not just predictions, but also interpretable insights into the mechanisms driving change. Consequently, this work transcends traditional machine learning, potentially illuminating fundamental aspects of intelligence itself – how information is processed, patterns are identified, and predictions about the future are made – opening new avenues for both technological innovation and a deeper understanding of cognitive processes.

The pursuit of mechanistic interpretability, as demonstrated by the Discrete Transformer, hinges on revealing the underlying structure governing complex systems. This architecture deliberately constrains the network to express computations in a disentangled manner, enabling the extraction of human-readable algorithms. This approach resonates with the observation of Andrey Kolmogorov: “The shortest explanation of any phenomenon is usually the true one.” The Discrete Transformer aims to achieve precisely that-a concise, transparent representation of the network’s logic. By distilling computations into symbolic code, the system exposes the ‘shortest explanation’ of its behavior, facilitating both understanding and verification of its internal workings. This echoes the core idea of the paper – that a well-structured system reveals its function through elegant simplicity.

Beyond the Code

The Discrete Transformer presents a compelling, if predictably difficult, step towards genuinely understanding the ‘black box’. The architecture’s insistence on disentanglement – forcing a modularity often absent in continuous networks – is less a novel invention than a rediscovery of a fundamental principle: structure dictates behavior. However, achieving true interpretability isn’t merely about extracting code; it’s about verifying that code represents the minimal sufficient explanation. The current approach, while promising, risks uncovering a locally interpretable solution that masks a far more complex underlying process. Modifying one part of the system, even with the goal of clarity, can trigger a cascade of unintended consequences elsewhere.

Future work must grapple with the problem of scale. Extracting algorithms from modestly sized networks is one challenge; scaling this to contemporary, massively parameterized models feels almost Sisyphean. More importantly, the focus should shift from finding algorithms to verifying them. Can formal methods – theorem proving, model checking – be integrated to guarantee the extracted code’s correctness and completeness? The ultimate test isn’t whether a human can read the code, but whether that code accurately reflects the network’s true computational core.

There is an inherent irony in seeking algorithmic explanations for systems designed through gradient descent – a process fundamentally rooted in continuous, non-symbolic optimization. The pursuit of discrete, human-readable code may ultimately reveal not intelligence, but a deep limitation in our own capacity to comprehend systems operating on principles subtly different from our own.

Original article: https://arxiv.org/pdf/2601.05770.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/