Uncovering How Vision Transformers Understand Object Relations: A Two-Stage Approach to Visual Reasoning

Despite the success of Vision Transformers (ViTs) in tasks like image classification and generation, they face significant challenges in handling abstract tasks involving relationships between objects. One key limitation is their difficulty in accurately performing visual relational tasks, such as determining if two objects are the same or different. Relational reasoning, which requires understanding spatial or comparative relationships between entities, is a natural strength of human vision but remains challenging for artificial vision systems. While ViTs excel at pixel-level semantic tasks, they struggle with the abstract operations required for relational reasoning, often relying on memorization rather than genuinely understanding relations. This limitation affects the development of AI models capable of advanced visual reasoning tasks such as visual question answering and complex object comparisons.

To address these challenges, a team of researchers from Brown University, New York University, and Stanford University employs methods from mechanistic interpretability to examine how ViTs process and represent visual relations. The researchers present a case study focusing on a fundamental yet challenging relational reasoning task: determining whether two visual entities are identical or different. By training pretrained ViTs on these “same-different” tasks, they observed that the models exhibit two distinct stages of processing, despite having no specific inductive biases to guide them. The first stage involves extracting local object features and storing them in a disentangled representation, referred to as the perceptual stage. This is followed by a relational stage, where these object representations are compared to determine relational properties.

These findings suggest that ViTs can learn to represent abstract relations to some extent, indicating the potential for more generalized and flexible AI models. However, failures in either the perceptual or relational stages can prevent the model from learning a generalizable solution to visual tasks, highlighting the need for models that can effectively handle both perceptual and relational complexities.

Technical Insights

The study provides insights into how ViTs process visual relationships through a two-stage mechanism. In the perceptual stage, the model disentangles object representations by attending to features like color and shape. In experiments using two “same-different” tasks—a discrimination task and a relational match-to-sample (RMTS) task—the authors show that ViTs trained on these tasks successfully disentangle object attributes, encoding them separately in their intermediate representations. This disentanglement makes it easier for the models to perform relational operations in the later stages. The relational stage then uses these encoded features to determine abstract relations between objects, such as assessing sameness or difference based on color or shape.

The benefit of this two-stage mechanism is that it allows ViTs to achieve a more structured approach to relational reasoning, enabling better generalization beyond the training data. By employing attention pattern analysis, the authors demonstrate that these models use distinct attention heads for local and global operations, shifting from object-level processing to inter-object comparisons in later layers. This division of labor within the model reveals a processing strategy that mirrors how biological systems operate, moving from feature extraction to relational analysis in a hierarchical manner.

This work is significant because it addresses the gap between abstract visual relational reasoning and transformer-based architectures, which have traditionally been limited in handling such tasks. The paper provides evidence that pretrained ViTs, such as those trained with the CLIP and DINOv2 architectures, are capable of achieving high accuracy in relational reasoning tasks when fine-tuned appropriately. Specifically, the authors note that CLIP and DINOv2-pretrained ViTs achieved nearly 97% accuracy on a test set after fine-tuning, demonstrating their capacity for abstract reasoning when guided effectively.

Another key finding is that the ability of ViTs to succeed in relational reasoning depends heavily on whether the perceptual and relational processing stages are well-developed. For instance, models with a clear two-stage process showed better generalization to out-of-distribution stimuli, suggesting that effective perceptual representations are foundational to accurate relational reasoning. This observation aligns with the authors’ conclusion that enhancing both the perceptual and relational components of ViTs can lead to more robust and generalized visual intelligence.

Conclusion

The findings of this paper shed light on the limitations and potential of Vision Transformers when faced with relational reasoning tasks. By identifying distinct processing stages within ViTs, the authors provide a framework for understanding and improving how these models handle abstract visual relations. The two-stage model—comprising a perceptual stage and a relational stage—offers a promising approach to bridging the gap between low-level feature extraction and high-level relational reasoning, which is crucial for applications like visual question answering and image-text matching.

The research underscores the importance of addressing both perceptual and relational deficiencies in ViTs to ensure they can generalize their learning to new contexts effectively. This work paves the way for future studies aimed at enhancing the relational capabilities of ViTs, potentially transforming them into models capable of more sophisticated visual understanding.


Check out the Paper here. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[FREE AI VIRTUAL CONFERENCE] SmallCon: Free Virtual GenAI Conference ft. Meta, Mistral, Salesforce, Harvey AI & more. Join us on Dec 11th for this free virtual event to learn what it takes to build big with small models from AI trailblazers like Meta, Mistral AI, Salesforce, Harvey AI, Upstage, Nubank, Nvidia, Hugging Face, and more.

Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.

🧵🧵 [Download] Evaluation of Large Language Model Vulnerabilities Report (Promoted)