MMSearch Engine: AI Search with Advanced Multimodal Capabilities to Accurately Process and Integrate Text and Visual Queries for Enhanced Search Results

Traditional search engines have predominantly relied on text-based queries, limiting their ability to process and interpret the increasingly complex information found online today. Many modern websites feature both text and images. Yet, the ability of conventional search engines to handle these multimodal queries, those that require an understanding of both visual and textual content, remains lacking. Large Language Models (LLMs) have shown great promise in enhancing the accuracy of textual search results. However, they still fall short when fully addressing queries involving images, videos, or other non-textual media.

One of the major challenges in search technology is bridging the gap between how search engines process textual data and the growing need to interpret visual information. Users today often seek answers that require more than text; they may upload images or screenshots, expecting AI to retrieve relevant content based on these inputs. However, current AI search engines remain text-centric and need help to grasp the depth of image-text relationships that could improve the quality and relevance of search results. This limitation constrains the effectiveness of such engines and hinders their need to be more cohesive, particularly in scenarios where visual context is as important as textual content.

Current methods for multimodal search integration still need to be more cohesive. While tools like Google Lens can perform rudimentary image searches, they must efficiently combine image recognition with comprehensive web data searches. The gap between interpreting visual inputs and connecting those with relevant text-based results limits the overall capability of AI-powered search engines. Moreover, the performance of these tools is further improved by the need for real-time processing for multimodal queries. Despite the rapid evolution of LLMs, there is still a need for a search engine that can cohesively process both text and images in a unified manner.

A research team from CUHK MMLab, ByteDance, CUHK MiuLar Lab, Shanghai AI Laboratory, Peking University, Stanford University, and Sensetime Research introduced the MMSearch Engine. This new tool transforms the search landscape by empowering any LLM to handle multimodal search queries. Unlike traditional engines, MMSearch incorporates a structured pipeline that processes text and visual inputs simultaneously. The researchers developed this system to optimize how LLMs handle the complexities of multimodal data, thereby improving the accuracy of search results. The MMSearch Engine is built to reprocess user queries, analyze relevant websites, and summarize the most informative responses based on text and images.

The MMSearch Engine is based on a three-step process designed to address the shortcomings of existing tools. First, the engine reformulates queries into a more conducive format for search engines. For example, if a query includes an image, MMSearch translates the visual data into meaningful text queries, making it easier for LLMs to interpret. Second, it reranks the websites that the search engine retrieves, prioritizing those that offer the most relevant information. Finally, the system summarizes the content by integrating visual and textual data, ensuring the response covers all aspects of the query. Notably, this multi-stage interaction ensures a robust search experience for users who require image and text-based results.

In terms of performance, the MMSearch Engine demonstrates considerable improvements over existing search tools. The researchers evaluated the system on 300 queries spanning 14 subfields, including technology, sports, and finance. MMSearch performed significantly better than Perplexity Pro, a leading commercial AI search engine. For instance, the MMSearch-enhanced version of GPT-4o achieved the highest overall score in multimodal search tasks. It surpassed Perplexity Pro in an end-to-end evaluation, particularly its ability to handle complex image-based queries. Across the 14 subfields, MMSearch handled over 2,900 unique images, ensuring that the data provided was relevant and well-matched to the query.

The detailed results of the study show that GPT-4o equipped with MMSearch achieved a notable 62.3% overall score in handling multimodal queries. This performance included querying, reranking, and summarizing content based on text and images. The comprehensive dataset, collected from various sources, was designed to exclude any information that could overlap with the LLM’s pre-existing knowledge, ensuring that the evaluation focused purely on the engine’s ability to process new, real-time data. Furthermore, MMSearch outperformed Perplexity Pro in reranking tasks, demonstrating its superior capacity to rank websites based on multimodal content.

In conclusion, the MMSearch Engine represents a significant advancement in multimodal search technology. By addressing the limitations of text-only queries and introducing a robust system for handling both textual and visual data, the researchers have provided a tool that could reshape how AI search engines operate. The system’s success in processing over 2,900 images and generating accurate search results across 300 unique queries showcases its potential in academic and commercial settings. Combining image data with advanced LLM capabilities has led to significant performance improvements, positioning MMSearch as a leading solution for the next generation of AI search engines.


Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

⏩ ⏩ FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)

Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.

🧵🧵 [Download] Evaluation of Large Language Model Vulnerabilities Report (Promoted)