Beyond Keywords: Refining Image Search with Language-Guided Constraints

Author: Denis Avetisyan

A new approach leverages the power of large language models to dramatically improve the precision of composed image retrieval, moving beyond simple keyword matching.

SoFT dissects instructions into positive and negative directives, then leverages CLIP similarity to softly guide image generation, effectively translating textual edits into visual modifications without requiring task-specific training-a plug-and-play approach to zero-shot compositional image manipulation.

This work introduces SoFT, a training-free re-ranking module and a novel dataset construction pipeline for zero-shot composed image retrieval using prescriptive and proscriptive textual constraints.

Despite advances in composed image retrieval, existing zero-shot methods struggle to balance desired and undesired attributes when interpreting user intent. This work, ‘Soft Filtering: Guiding Zero-shot Composed Image Retrieval with Prescriptive and Proscriptive Constraints’, introduces SoFT, a training-free module that refines retrieval by leveraging large language models to extract both what should and should not be present in the target image. By re-ranking candidates with these complementary textual constraints, SoFT significantly improves retrieval accuracy on standard benchmarks without requiring labeled data or modifying the base model. Could this approach unlock more nuanced and reliable image search experiences, better reflecting the complexities of human requests?

Deconstructing the Visual Query: Beyond Pixel-Level Matching

Conventional image retrieval systems demonstrate proficiency in identifying visually similar images – a search for “red cars” readily returns images featuring red automobiles. However, these systems falter when presented with more complex queries that describe alterations or relationships, such as “a red car with a spoiler” or “a painting in the style of Van Gogh.” This limitation stems from their reliance on low-level feature matching; they assess pixel-level similarities rather than comprehending the semantic meaning and nuanced modifications expressed in natural language. Consequently, a search for “a cat wearing sunglasses” might return images of cats and sunglasses displayed separately, failing to recognize the specific composite scene requested – highlighting a significant gap in bridging the semantic divide between visual content and textual descriptions.

Composed Image Retrieval represents a significant leap beyond traditional methods, as it necessitates a system’s ability to decipher the relationships articulated between visual content and descriptive language. Simple similarity matching, while effective for finding visually comparable images, falters when faced with requests that involve complex arrangements or modifications – for example, “a red car next to a blue house.” Successfully fulfilling such requests demands more than pixel-level comparisons; it requires an understanding of spatial prepositions, object interactions, and the contextual meaning embedded within the textual query. This shift necessitates models that can reason about the composition of a scene and how elements relate to one another, moving beyond identifying what is present in an image to understanding how it is arranged and described.

Effective composed image retrieval hinges on the development of robust multi-modal representation learning techniques. These approaches aim to create a shared embedding space where both visual features extracted from images and semantic information gleaned from text reside. Instead of treating images and text as separate entities, the goal is to learn correspondences – understanding how textual descriptions relate to specific visual elements and their relationships within an image. This requires models capable of not just recognizing objects, but also comprehending spatial arrangements, attributes, and actions described in language. Successfully bridging this modality gap allows systems to move beyond simply finding visually similar images and towards retrieving images that accurately reflect the meaning conveyed in a composed query, opening possibilities for more intelligent and nuanced image search.

Unlike traditional information retrieval, Contrastive Image Retrieval (CIR) leverages soft filtering to re-rank image candidates using constraints generated by a large language model.

Supervised vs. Zero-Shot: The Evolution of Cross-Image Retrieval

Initial approaches to Cross-Image Retrieval (CIR) were predicated on Supervised CIR methodologies. These methods required extensive, manually annotated datasets consisting of image-text-target triplets, where each triplet links an image to a corresponding textual description and the target image to be retrieved. The training process involved learning a mapping function that associates the input image and text with the correct target image, typically using a loss function that minimizes the distance between the predicted and ground truth target images. The performance of these supervised models was heavily dependent on the size and quality of the labeled training data, and generalization to unseen data distributions or novel concepts was often limited by the biases present in the training set.

The creation of large, labeled datasets for supervised Cross-Image Retrieval (CIR) presents significant logistical and financial challenges. Data acquisition necessitates manual annotation of image-text-target triplets, a process that is both time-consuming and expensive at scale. Furthermore, models trained on fixed datasets exhibit limited generalization capability when applied to novel image categories or target attributes not present in the training data. This inflexibility motivated the development of Zero-Shot CIR methods, which aim to circumvent the need for task-specific labeled data and improve adaptability to previously unseen scenarios by leveraging pre-trained models.

Zero-Shot Cross-Image Retrieval (CIR) circumvents the need for labeled, task-specific datasets by leveraging pre-trained vision-language models, notably CLIP (Contrastive Language-Image Pre-training). These models are trained on extensive datasets of image-text pairs, enabling them to establish a shared embedding space where visual and textual representations of concepts are aligned. During retrieval, a query image is encoded into this embedding space, and the system identifies the closest matching text descriptions – or vice-versa – based on cosine similarity or other distance metrics. This approach allows for generalization to novel object categories and retrieval criteria without requiring additional training, providing increased adaptability and reducing the costs associated with data annotation.

The multi-target CIRR dataset provides original image triplets alongside corresponding results when multiple targets are present, with the selected target highlighted in a red box for clarity.

Refining the Search: LLMs as Constraint Architects

Recent advancements in constraint refinement leverage the capabilities of large language models (LLMs), such as GPT-4o, to enhance both the generation of modification text and the accuracy of the retrieval process. Specifically, LLMs are applied to rephrase and optimize modification instructions, ensuring clarity and relevance to the target data. Simultaneously, these models improve retrieval by refining search queries and evaluating the semantic similarity between queries and candidate items. This dual application of LLMs allows for more precise and effective constraint-based modification, addressing limitations in traditional methods that rely on manually crafted rules or simple keyword matching. The integration of LLMs facilitates a more nuanced understanding of both desired changes and the underlying data, leading to improved performance in tasks requiring targeted modifications.

Dual Textual Constraints leverage large language models to enhance constraint specification by incorporating both positive and negative guidance. Traditionally, constraints focused on defining desired attributes – the prescriptive element. However, LLMs now facilitate the inclusion of proscriptive constraints, explicitly outlining undesired characteristics. This dual approach allows for more nuanced control over the modification or retrieval process, enabling systems to not only seek specific features but also actively avoid others. The combination of prescriptive and proscriptive guidance improves precision and reduces unintended outcomes, particularly in applications like image editing or product search where avoiding certain attributes is as important as including others.

Single-Target Rewriting, a technique for iteratively refining text prompts to achieve desired outputs, and the development of comprehensive datasets for evaluation are both enhanced through the application of large language models (LLMs). LLMs facilitate the automated refinement of prompts used in Single-Target Rewriting, improving the efficiency and quality of the iterative process. Furthermore, datasets such as FashionIQ and CIRR, designed to rigorously test retrieval and modification capabilities, benefit from LLM-assisted data augmentation and error analysis. Specifically, LLMs can generate more diverse and challenging test cases, identify subtle errors in system outputs, and assist in the creation of more robust evaluation metrics, leading to more reliable performance assessments.

The FashionIQ dataset demonstrates multi-target image retrieval, showcasing original images alongside retrieved results and refined textual descriptions highlighting the selected target (indicated by a red bounding box).

Models in Concert: Augmentation and the Pursuit of Robustness

CIReVL, OSrCIR, and LDRE represent a class of Cross-Image Retrieval (CIR) models that utilize Large Language Models (LLMs) to enhance retrieval performance. These models move beyond traditional methods by employing LLM-driven text generation to rephrase queries, effectively expanding the semantic coverage and improving matching with relevant images. CIReVL, for example, leverages an LLM to generate diverse and contextually relevant queries from the initial text input. OSrCIR and LDRE further refine this approach with specific architectural choices and training strategies, also centered on LLM-based query reformulation. This technique proves particularly valuable in CIR tasks where lexical mismatch between queries and image captions is prevalent, as the generated queries can capture a broader range of semantically similar expressions.

Data augmentation techniques, particularly Multi-Target Triplet Construction, address the limitations of training robust Cross-Image Retrieval (CIR) models, especially when utilizing datasets with inherent challenges like those present in CIRCO. This strategy involves creating synthetic training samples by generating multiple positive and negative pairings based on existing image-text data. The construction of triplets – consisting of an anchor image, a positive matching text description, and a negative non-matching description – allows the model to learn more discriminative features. By increasing the diversity and volume of training data through this process, the model becomes less susceptible to overfitting and improves its generalization performance on unseen data, leading to enhanced retrieval accuracy in complex scenarios.

The integration of the CIReVL model with the SoFT (Simple Fine-Tuning) module resulted in a mean Average Precision at 50 (mAP@50) score of 27.93 on the CIRCO dataset. This represents a quantifiable 6.13 point improvement over the baseline CIReVL model, which utilized a ViT-L/14 architecture. The mAP@50 metric assesses the precision of information retrieval at an Intersection over Union (IoU) threshold of 50%, providing a standardized measure of performance on the CIRCO cross-image retrieval benchmark.

Fusion-based models and inversion-based models capitalize on the capabilities of Contrastive Language-Image Pre-training (CLIP) to improve multi-modal understanding. Fusion-based approaches typically combine CLIP’s image and text embeddings through concatenation or other fusion layers, allowing the model to jointly reason about both modalities. Inversion-based models, conversely, utilize CLIP to reconstruct an image from a given text query, effectively creating a visual representation of the query which is then used for comparison with candidate images. Both strategies benefit from CLIP’s pre-trained ability to align visual and textual representations, resulting in improved performance on cross-modal retrieval tasks by effectively bridging the semantic gap between image and text data.

The Multi-target Triplet Dataset pipeline generates targeted modifications by first selecting diverse targets to represent open-ended user intent, then refining the modification text to focus on a specific target while incorporating contrastive distractors for precise evaluation.

Beyond the Horizon: The Evolving Landscape of Composed Image Retrieval

Composed Image Retrieval (CIR) is undergoing a significant evolution, driven by the synergistic interplay of cutting-edge technologies. Powerful Large Language Models (LLMs) now provide the capacity to understand and articulate complex visual requests, while refined data augmentation techniques generate diverse and representative training datasets. This combination is further enhanced by advancements in model architectures, enabling systems to more effectively bridge the semantic gap between textual queries and visual content. The result is a dramatic expansion of CIR capabilities, moving beyond simple keyword matching to nuanced understanding and precise image selection – ultimately unlocking possibilities for more intelligent and nuanced visual search.

Recent advancements in composed image retrieval leverage a novel soft filtering module, SoFT, which utilizes large language models to generate nuanced constraints for refining search results. This approach moves beyond simple keyword matching by interpreting the intent behind a composed query and applying it as a filter during the retrieval process. Evaluations demonstrate SoFT’s efficacy; the module achieves a remarkable Recall at 1 (R@1) score of 70.31, indicating a high probability of the most relevant image appearing at the very top of the results. Furthermore, SoFT provides a substantial 4.27 point improvement in mean Average Precision at 5 (mAP@5) compared to existing baseline models, suggesting a significant enhancement in the overall quality and relevance of the retrieved image set. This re-ranking capability represents a pivotal step towards more intelligent and accurate visual search systems.

Recent advances in composed image retrieval have culminated in a particularly strong performance on the challenging Multi-Target FashionIQ dataset. By combining the SEARLE model – known for its ability to generate detailed image compositions – with the SoFT module, which refines results using language-based constraints, researchers achieved a recall at 10 (R@10) score of 45.50. This represents a significant leap forward in the field, demonstrating the power of integrating large language models with visual search technologies to more accurately identify images matching complex, multi-faceted descriptions. The success on this dataset highlights the potential for creating retrieval systems capable of understanding nuanced requests and delivering highly relevant visual content, particularly within the dynamic landscape of fashion e-commerce and personalized visual discovery.

Continued development in composed image retrieval (CIR) is anticipated to prioritize three key areas: bolstering system robustness against ambiguous or noisy input, increasing computational efficiency for real-time applications, and enhancing user controllability over the creative process. These advancements promise to unlock CIR’s potential across diverse fields; e-commerce platforms could offer highly specific product searches based on complex descriptions, content creators could rapidly generate visual assets tailored to precise artistic visions, and visual search engines could deliver results that truly reflect the user’s intent. Further refinement will likely involve exploring novel model architectures, advanced training methodologies, and innovative approaches to constraint satisfaction, ultimately enabling CIR systems to seamlessly integrate into everyday workflows and empower a new generation of visual applications.

Varying the weighting parameter <span class="katex-eq" data-katex-display="false">\lambda</span> from 0.1 to 0.9 demonstrates its impact on retrieval performance across the CIRCO, CIRR, and FashionIQ datasets using the CLIP L/14 model, as defined in Equation (2). — Varying the weighting parameter $\lambda$ from 0.1 to 0.9 demonstrates its impact on retrieval performance across the CIRCO, CIRR, and FashionIQ datasets using the CLIP L/14 model, as defined in Equation (2).

The pursuit of refined retrieval systems necessitates a willingness to challenge established boundaries. This work, introducing SoFT and its dual-constraint approach, embodies that spirit. It doesn’t simply accept the limitations of zero-shot learning; instead, it actively probes for ways to guide the process with both prescriptive and proscriptive constraints. This resonates deeply with the sentiment expressed by Edsger W. Dijkstra: “It’s not enough to get it right; you have to understand why it’s right.” SoFT, through its careful construction of multi-target datasets and re-ranking module, seeks not merely improved precision in composed image retrieval, but a deeper understanding of how large language models can be effectively harnessed for multimodal tasks. The system intentionally tests the edges of what’s possible, viewing potential ‘bugs’ – in this case, retrieval failures – as signals to refine the approach.

Beyond the Filter

The current work, while demonstrating gains in composed image retrieval, merely scratches the surface of a far more fundamental challenge: defining visual semantics through language. SoFT rightly identifies the power of both positive and negative constraints, yet it treats these as static signals. A more rigorous exploration would involve actively testing the boundaries of these constraints, probing where language fails to adequately capture visual nuance, and where apparent contradictions reveal deeper, more complex relationships. The system functions, in essence, as a sophisticated echo chamber-it confirms what the language model already ‘knows.’

The newly constructed dataset, while a step forward, remains a curated artifact. True progress demands a shift towards self-generating datasets – systems capable of identifying and resolving ambiguities in visual-textual pairings, and creating challenging examples that expose the limitations of current retrieval models. This isn’t simply about increasing dataset size, but about constructing a benchmark that actively resists easy solutions, forcing the field to confront the inherent messiness of real-world visual data.

Ultimately, the architecture of retrieval systems mirrors the architecture of understanding itself. The most fruitful lines of inquiry will likely lie not in refining the filters, but in dismantling the very notion of a ‘correct’ answer, embracing the inherent indeterminacy of visual meaning. Chaos is not an enemy, but a mirror of architecture reflecting unseen connections.

Original article: https://arxiv.org/pdf/2512.20781.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/