Seeing the Parts: How Vision Transformers Build Images From Primitives

Author: Denis Avetisyan

New research reveals that Vision Transformers decompose images into fundamental components, mirroring how humans understand visual scenes.

The research details a compositionality framework applied to Vision Transformers, focusing on learning a composition function during Level 1 Discrete Wavelet Transform decomposition-specifically utilizing coefficients <span class="katex-eq" data-katex-display="false">DaD_a</span>, <span class="katex-eq" data-katex-display="false">DbD_b</span>, <span class="katex-eq" data-katex-display="false">DcD_c</span>, and <span class="katex-eq" data-katex-display="false">DdD_d</span>-to enable a more nuanced understanding of feature representation. — The research details a compositionality framework applied to Vision Transformers, focusing on learning a composition function during Level 1 Discrete Wavelet Transform decomposition-specifically utilizing coefficients $DaD_a$ , $DbD_b$ , $DcD_c$ , and $DdD_d$ -to enable a more nuanced understanding of feature representation.

This study uses Discrete Wavelet Transforms to analyze compositional representations learned by Vision Transformer encoders, demonstrating compositional behavior at the final layer.

While deep neural networks excel at complex pattern recognition, understanding how they represent and combine information remains a key challenge. This work, ‘Exploring Compositionality in Vision Transformers using Wavelet Representations’, investigates the compositional structure of representations learned by the Vision Transformer (ViT) encoder. By leveraging the Discrete Wavelet Transform to decompose images into fundamental primitives, we demonstrate that ViT representations at the final encoder layer exhibit approximate compositionality in latent space. This raises the question of whether explicitly encouraging such compositional structure could further enhance the efficiency and interpretability of vision transformers.

Beyond Pixels: Why Computers Still Struggle to See

Conventional computer vision systems frequently encounter difficulties when interpreting images beyond simple object recognition. These methods often analyze images as collections of pixels, neglecting the crucial relationships between objects and their surrounding context. A system might accurately identify a ‘dog’ and a ‘ball’, but fail to understand if the dog is playing with the ball, ignoring the spatial arrangement and implied interaction. This limitation stems from a reliance on statistical correlations rather than genuine comprehension; the system recognizes patterns but lacks the ability to infer meaning from the scene as a whole, hindering performance in complex, real-world scenarios where contextual understanding is paramount. Consequently, these systems struggle with ambiguities, variations in lighting, and occlusions – factors easily resolved by human vision through a nuanced grasp of the visual narrative.

Conventional computer vision systems excel at identifying objects within an image – labeling a cat, a car, or a tree with remarkable accuracy. However, this capability plateaus when faced with understanding the relationships between those objects and their context. These systems struggle with compositional reasoning – the ability to infer meaning from how elements are arranged and interact. For instance, a system might recognize a “person” and a “bicycle”, but fail to understand that the person is riding the bicycle, or that the bicycle is positioned on a road, limiting its capacity for complex scene interpretation and ultimately hindering performance in real-world applications demanding nuanced understanding.

Structural similarity index maps (SSIM) reveal that, despite image composition, encoder layer outputs do not immediately exhibit visual indications of compositional understanding across the red, green, and blue channels.

Vision Transformers: A New Approach, But Is It Really Different?

The Vision Transformer (ViT) fundamentally differs from convolutional neural networks by applying the Transformer architecture, originally designed for natural language processing, to images. This is achieved by dividing an input image into a fixed-size set of non-overlapping patches, which are then linearly embedded into a vector space. These embedded patches are treated as analogous to tokens in a sentence, forming a sequence that can be processed by a standard Transformer encoder. This patch-based approach allows ViT to model global relationships within an image without relying on the locality-based assumptions inherent in convolutional operations, effectively translating the principles of sequence transduction to the visual domain.

The core benefit of self-attention in Vision Transformers (ViT) lies in its ability to model relationships between any two image patches, regardless of their distance. Traditional convolutional neural networks (CNNs) inherently focus on local relationships due to the constrained receptive field of convolutional filters. Self-attention, however, computes a weighted sum of all patches to represent each patch, allowing the model to directly capture long-range dependencies – for example, relating an object in the foreground to its context in the background – without being limited by spatial proximity. This global context awareness improves performance on tasks requiring understanding of image-wide relationships, such as object detection, image segmentation, and image classification, particularly in scenarios with complex scenes or occlusions.

Vision Transformers utilize positional embeddings to address the inherent lack of spatial awareness in the Transformer architecture, which originally processes sequential data without considering position. These embeddings are added to the flattened image patch embeddings, providing the model with information about the location of each patch within the original image. Furthermore, a dedicated learnable “CLS” token is prepended to the sequence of patch embeddings; the final hidden state corresponding to this CLS token is then used as a global representation of the image for downstream classification tasks. This approach allows the model to aggregate information from all patches and perform image-level predictions based on this learned representation.

A learned combination of ViT-B encoder representations consistently achieves higher CKA scores across layers than a simple sum, indicating improved feature alignment when averaged over 10K images.

Compositionality and Robust Representation Learning: Breaking Down the Image

Compositionality in representation learning refers to the capacity of a system to construct representations of complex data by combining representations of simpler, constituent parts. This approach aligns with human cognitive processes, where understanding is built upon the assembly of basic concepts. In machine learning, a compositional representation allows for generalization to novel data combinations, as the system can leverage its understanding of individual components. The benefit lies in efficient data representation and the ability to extrapolate knowledge beyond the training set; instead of learning each complex concept independently, the model learns how to combine simpler concepts to form more complex ones, leading to a more flexible and robust understanding of the data.

The Discrete Wavelet Transform (DWT) functions as a compositional representation learning method by decomposing an input image into a set of sub-bands representing different frequency components, commonly referred to as Wavelet Primitives. This decomposition is achieved through successive convolutions with high-pass and low-pass filters, followed by downsampling, creating approximations and details at multiple resolutions. The resulting Wavelet Primitives – typically consisting of coefficients representing horizontal, vertical, and diagonal details, alongside an approximation component – allow for a sparse and efficient representation of the image data. This hierarchical structure facilitates analysis by isolating specific frequency bands, enabling tasks such as feature extraction, compression, and denoising, while mirroring the compositional principle of building complex representations from simpler, fundamental components.

Evaluating the compositional quality of learned representations requires quantitative metrics beyond standard accuracy benchmarks. Centered Kernel Alignment (CKA) assesses the similarity between the representational geometries of different layers or models, providing a measure of how well features align with underlying semantic concepts. The Structural Similarity Index (SSIM), originally designed for image quality assessment, can be adapted to compare reconstructed images from decomposed representations – such as those derived from the Discrete Wavelet Transform – with the original images, quantifying the preservation of structural information. Higher SSIM and CKA scores generally indicate better compositional structure, suggesting the model effectively breaks down and recombines concepts in a meaningful way; these metrics therefore provide insight into whether a representation is genuinely compositional, even if it doesn’t directly translate to improved classification performance.

Experimental results indicate that the representations learned by the Vision Transformer (ViT) encoder exhibit compositional properties. Specifically, images were reconstructed from Discrete Wavelet Transform (DWT) primitives generated at the final layer of the ViT encoder, and classification accuracy was evaluated on the reconstructed images. The study found that classification performance using these reconstructed images was nearly equivalent to that of the original ViT model, suggesting the encoder effectively captures and represents image information in a compositional manner based on DWT primitives. This demonstrates that the ViT encoder’s learned features can be decomposed into fundamental frequency components and recombined with minimal loss of representational power, as measured by downstream classification tasks.

Analysis of image reconstruction using the Discrete Wavelet Transform at Level 2 – a deeper decomposition than Level 1 – demonstrates a slight reduction in classification accuracy compared to the original ViT model. However, measurable performance is still achieved, indicating that a significant degree of compositional information is retained even with increased decomposition. This suggests that while some information is lost during deeper wavelet decomposition, the core structure and relationships necessary for effective image representation are not entirely eliminated, and the resulting representations remain useful for downstream tasks like classification.

Evaluations were conducted to determine the persistence of compositional representation qualities under common image distortions. Results indicate that the learned representations, derived from the Discrete Wavelet Transform, maintain a measurable degree of robustness when subjected to JPEG compression and the addition of Gaussian noise. Specifically, performance metrics remained within acceptable ranges despite these distortions, suggesting that the ability to decompose and reconstruct images based on wavelet primitives is not significantly degraded by these typical forms of image manipulation, and that the compositional information remains largely intact even with the introduction of noise or compression artifacts.

While Discrete Fourier and Cosine Transforms yield subbands representing global frequencies, the Discrete Wavelet Transform uniquely provides spatially localized subbands that facilitate compositionality analysis.

Expanding the Horizon: What Does All This Mean for the Future?

Initially heralded for image classification, Vision Transformers are rapidly demonstrating a remarkable adaptability extending far beyond their original scope. Current research showcases their effectiveness in object detection, where they accurately identify and localize multiple objects within an image, and semantic segmentation, enabling pixel-level understanding of image content. Moreover, these models are now successfully generating descriptive captions for images, bridging the gap between visual data and natural language. This versatility stems from their inherent ability to model long-range dependencies within images, allowing them to capture contextual information crucial for complex visual reasoning tasks and ultimately suggesting a future where a single transformer architecture can underpin a wide range of computer vision applications.

Vision Transformers demonstrate a marked improvement in generalization capabilities due to their inherent capacity to model contextual relationships within images. Unlike convolutional neural networks that focus on local features, these transformers analyze images as a whole, discerning how different parts relate to each other. This holistic approach enables the network to learn robust representations – features that are not easily disrupted by variations in lighting, pose, or viewpoint. Consequently, a Vision Transformer trained on a specific dataset often performs remarkably well on previously unseen images and in challenging real-world scenarios, such as those with occlusion or unusual perspectives. The ability to effectively leverage context allows these models to move beyond simply recognizing patterns to truly understanding the content of an image, fostering adaptability and resilience in diverse applications.

Ongoing investigation into Vision Transformers is heavily geared towards practical implementation and broadened scope. Current efforts prioritize diminishing computational demands, making these models more accessible for deployment on devices with limited resources. Simultaneously, researchers are developing techniques to mitigate the need for massive training datasets, a common limitation of many deep learning approaches. This includes exploring self-supervised learning and data augmentation strategies. Beyond traditional computer vision tasks, the potential of Vision Transformers is being actively investigated in specialized fields like medical imaging, where accurate analysis of complex scans is crucial, and robotics, where real-time visual understanding is essential for autonomous navigation and object manipulation. These advancements promise to unlock even more sophisticated and impactful applications in the years to come.

Applying the learned composition model <span class="katex-eq" data-katex-display="false">g^*</span>'s weights to the original image's sub-bands successfully reconstructs the image. — Applying the learned composition model $g^*$ ‘s weights to the original image’s sub-bands successfully reconstructs the image.

The pursuit of compositional representations in Vision Transformers, as detailed in the study, feels predictably optimistic. The Discrete Wavelet Transform offers a neat way to dissect images, revealing how ViTs think about primitives, but this is merely shifting the complexity. It’s a temporary reprieve. The encoder layer might exhibit compositional behavior now, but production will invariably find the edge cases-the images that break the carefully constructed hierarchy. Fei-Fei Li once said, “AI is not about replacing humans; it’s about augmenting our capabilities.” This research exemplifies that augmentation – a refined tool, destined to be superseded. The real challenge isn’t building more elegant architectures; it’s accepting that every solution is just a more sophisticated form of technical debt.

What Comes Next?

The demonstration of compositional structure within Vision Transformer representations, even if achieved through the lens of wavelet decomposition, merely shifts the question. It doesn’t solve compositionality, but rather locates it-a difference production systems will inevitably highlight. The current work relies on a specific, mathematically convenient, primitive – the wavelet. Future efforts will undoubtedly explore alternative decompositions, and the inevitable loss of signal that accompanies any such simplification. The real test won’t be whether a network can represent composition, but whether it does so consistently under adversarial perturbations, or after a few months of real-world data drift.

One anticipates a proliferation of metrics attempting to quantify compositional generalization. These metrics, while superficially appealing, will prove brittle. Every benchmark is a closed book, and every achieved score is a temporary truce. The focus should perhaps move beyond finding composition to inducing it – designing architectures or training regimes that actively discourage reliance on spurious correlations.

Ultimately, the persistent challenge remains: building systems that fail gracefully, and predictably. A beautiful decomposition is of little comfort when the entire pipeline goes dark on a Monday. The elegance of the theory will be measured not in citations, but in uptime.

Original article: https://arxiv.org/pdf/2512.24438.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Beyond Pixels: Why Computers Still Struggle to See

Vision Transformers: A New Approach, But Is It Really Different?

Compositionality and Robust Representation Learning: Breaking Down the Image

Expanding the Horizon: What Does All This Mean for the Future?

What Comes Next?

See also: