Computer Vision Category - MarkTechPost

Collective Monte Carlo Tree Search (CoMCTS): A New Learning-to-Reason Method for Multimodal Large Language Models

Divyesh Vitthal Jawkhede — Sat, 28 Dec 2024 07:32:35 +0000

In today’s world, Multimodal large language models (MLLMs) are advanced systems that process and understand multiple input forms, such as text and images. By interpreting these diverse inputs, they aim to reason through tasks and generate accurate outputs. However, MLLMs often fail at complex tasks because they lack structured processes to break problems into smaller steps and instead provide direct answers without clear intermediate reasoning. These limitations reduce the success and efficiency of MLLMs in solving intricate problems.

Traditional methods for reasoning in multimodal large language models (MLLMs) have many problems. Prompt-based methods, like Chain-of-Thought, use set steps to copy human reasoning but struggle with difficult tasks. Plant-based methods, like Tree or Graph-of-Thought, try to find reasoning paths but are not flexible or reliable. Learning-based methods, like Monte Carlo Tree Search (MCTS), are slow and do not help with deep thinking. Most MLLMs rely on “direct prediction,” giving short answers without clear steps. Although MCTS works well in games and robotics, it is unsuited for MLLMs, and collective learning does not build strong step-by-step reasoning. These issues make it hard for MLLMs to solve complex problems.

To mitigate these issues, a team researchers from Nanyang Technological University, Tsinghua University, Baidu, and Sun Yat-sen University proposed CoMCTS, a framework to improve reasoning-path search in tree search tasks. Instead of relying on one model, it combines multiple pre-trained models to expand and evaluate candidate paths. This approach differs from traditional methods because it uses a more efficient strategy: several models work together, allowing for better performance and reducing errors during the reasoning process.

It consisted of four key steps: Expansion, Simulation, Backpropagation, and Selection. In the Expansion step, several models looked for different solutions simultaneously, increasing the variety of possible answers. In the Simulation step, incorrect or less effective paths were removed, making the search easier. During the Backpropagation step, the models improved by learning from their past mistakes and using that knowledge to make better predictions. The last step used a statistical method to choose the best action for the model to take. Reflective reasoning in this process helped the model learn from previous errors to make better decisions in similar tasks.

The researchers created the Mulberry-260K dataset, which comprised 260K multimodal input questions, combining text instructions and images from various domains, including general multimodal understanding, mathematics, science, and medical image understanding. The dataset was constructed using CoMCTS with training limited to 15K samples to avoid overabundance. The reasoning tasks required an average of 7.5 steps, with most tasks falling within the 6 to 8-step range. CoMCTS was implemented using four models: GPT4o, Qwen2-VL-7B, LLaMA-3.2-11B-Vision-Instruct, and Qwen2-VL-72B. The training process involved a batch size of 128 and a learning rate 1e-5 for two epochs.

The results demonstrated significant performance improvements over the baseline models, with gains of +4.2% and +7.5% for Qwen2-VL-7B and LLaMA-3.2-11B-Vision-Instruct, respectively. Additionally, the Mulberry dataset outperformed reasoning models like LLaVA-Reasoner-8B and Insight-V-8B, showing superior performance on various benchmarks. Upon evaluation, CoMCTS improved its performance by 63.8%. The involvement of reflective reasoning data led to slight improvements in model performance. This reveals the effects of Mulberry-260K and CoMCTS in improving the accuracy and flexibility of reasoning.

In conclusion, the proposed CoMCTS proves to be an approach that improves reasoning in multimodal large language models (MLLMs) by incorporating collective learning into tree search methods. This framework improved the efficiency of searching for a reasoning path, as demonstrated by the Mulberry-260K dataset and the Mulberry model, which surpasses traditional models in complex reasoning tasks. The proposed methods provide valuable insights for future research, can serve as a basis for advancing MLLMs, and can act as a baseline for developing more efficient models capable of handling increasingly complex tasks.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

The post Collective Monte Carlo Tree Search (CoMCTS): A New Learning-to-Reason Method for Multimodal Large Language Models appeared first on MarkTechPost.

Microsoft and Tsinghua University Researchers Introduce Distilled Decoding: A New Method for Accelerating Image Generation in Autoregressive Models without Quality Loss

Asif Razzaq — Thu, 26 Dec 2024 16:19:56 +0000

Autoregressive (AR) models have changed the field of image generation, setting new benchmarks in producing high-quality visuals. These models break down the image creation process into sequential steps, each token generated based on prior tokens, creating outputs with exceptional realism and coherence. Researchers have widely adopted AR techniques for computer vision, gaming, and digital content creation applications. However, the potential of AR models is often constrained by their inherent inefficiencies, particularly their slow generation process, which remains a significant hurdle in real-time applications.

Among many concerns, a critical one that AR models face is their speed. The token-by-token generation process is inherently sequential, meaning each new token must wait for its predecessor to complete. This approach limits scalability and results in high latency during image generation tasks. For instance, generating a 256×256 image using traditional AR models like LlamaGen requires 256 steps, translating to approximately five seconds on modern GPUs. Such delays hinder their deployment in applications that demand instantaneous results. Also, while AR models excel in maintaining the fidelity of their outputs, they struggle to meet the growing demand for both speed and quality in large-scale implementations.

Efforts to accelerate AR models have yielded various methods, such as predicting multiple tokens simultaneously or adopting masking strategies during generation. These approaches aim to reduce the required steps but often compromise the quality of the generated images. For example, in multi-token generation techniques, the assumption of conditional independence among tokens introduces artifacts, undermining the cohesiveness of the output. Similarly, masking-based methods allow for faster generation by training models to predict specific tokens based on others, but their effectiveness diminishes when generation steps are drastically reduced. These limitations highlight the need for a new approach to enhance AR model efficiency.

Tsinghua University and Microsoft Research researchers have introduced a solution to these challenges: Distilled Decoding (DD). This method builds on flow matching, a deterministic mapping that connects Gaussian noise to the output distribution of pre-trained AR models. Unlike conventional methods, DD does not require access to the original training data of the AR models, making it more practical for deployment. The research demonstrated that DD can transform the generation process from hundreds of steps to as few as one or two while preserving the quality of the output. For example, on ImageNet-256, DD achieved a speed-up of 6.3x for VAR models and an impressive 217.8x for LlamaGen, reducing generation steps from 256 to just one.

The technical foundation of DD is based on its ability to create a deterministic trajectory for token generation. Using flow matching, DD maps noisy inputs to tokens to align their distribution with the pre-trained AR model. During training, the mapping is distilled into a lightweight network that can directly predict the final data sequence from a noise input. This process ensures faster generation and provides flexibility in balancing speed and quality by allowing intermediate steps when needed. Unlike existing methods, DD eliminates the trade-off between speed and fidelity, enabling scalable implementations across diverse tasks.

In experiments, DD highlights its superiority over traditional methods. For instance, using VAR-d16 models, DD achieved one-step generation with an FID score increase from 4.19 to 9.96, showcasing minimal quality degradation despite a 6.3x speed-up. For LlamaGen models, the reduction in steps from 256 to one resulted in an FID score of 11.35, compared to 4.11 in the original model, with a remarkable 217.8x speed improvement. DD demonstrated similar efficiency in text-to-image tasks, reducing generation steps from 256 to two while maintaining a comparable FID score of 28.95 against 25.70. The results underline DD’s ability to drastically enhance speed without significant loss in image quality, a feat unmatched by baseline methods.

Several key takeaways from the research on DD include:

DD reduces generation steps by orders of magnitude, achieving up to 217.8x faster generation than traditional AR models.
Despite the accelerated process, DD maintains acceptable quality levels, with FID score increases remaining within manageable ranges.
DD demonstrated consistent performance across different AR models, including VAR and LlamaGen, regardless of their token sequence definitions or model sizes.
The approach allows users to balance quality and speed by choosing one-step, two-step, or multi-step generation paths based on their requirements.
The method eliminates the need for the original AR model training data, making it feasible for practical applications in scenarios where such data is unavailable.
Due to its efficient distillation approach, DD can potentially impact other domains, such as text-to-image synthesis, language modeling, and image generation.

In conclusion, with the introduction of Distilled Decoding, researchers have successfully addressed the longstanding speed-quality trade-off that has plagued AR generation processes by leveraging flow matching and deterministic mappings. The method accelerates image synthesis by reducing steps drastically and preserves the outputs’ fidelity and scalability. With its robust performance, adaptability, and practical deployment advantages, Distilled Decoding opens new frontiers in real-time applications of AR models. It sets the stage for further innovation in generative modeling.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

The post Microsoft and Tsinghua University Researchers Introduce Distilled Decoding: A New Method for Accelerating Image Generation in Autoregressive Models without Quality Loss appeared first on MarkTechPost.

CoordTok: A Scalable Video Tokenizer that Learns a Mapping from Co-ordinate-based Representations to the Corresponding Patches of Input Videos

Divyesh Vitthal Jawkhede — Thu, 26 Dec 2024 01:56:36 +0000

Breaking down videos into smaller, meaningful parts for vision models remains challenging, particularly for long videos. Vision models rely on these smaller parts, called tokens, to process and understand video data, but creating these tokens efficiently is difficult. While recent tools achieve better video compression than older methods, they struggle to handle large video datasets effectively. A key issue is their inability to fully utilize temporal coherence, the natural pattern where video frames are often similar over short periods, which video codecs use for efficient compression. These tools are also computationally expensive to train and are limited to short clips, making them not very effective in capturing patterns and processing longer videos.

Current video tokenization methods have high computational costs and struggle to handle long video sequences efficiently. Early approaches used image tokenizers to compress videos frame by frame but ignored the natural continuity between frames, reducing their effectiveness. Later methods introduced spatiotemporal layers, reduced redundancy, and used adaptive encoding, but they still required rebuilding entire video frames during training, which limited them to short clips. Video generation models like autoregressive methods, masked generative transformers, and diffusion models are also limited to short sequences.

To solve this, researchers from KAIST and UC Berkeley proposed CoordTok, which learns a mapping from coordinate-based representations to the corresponding patches of input videos. Motivated by recent advances in 3D generative models, CoordTok encodes a video into factorized triplane representations and reconstructs patches corresponding to randomly sampled (x, y, t) coordinates. This approach allows large tokenizer models to be trained directly on long videos without requiring excessive resources. The video is divided into space-time patches and processed using transformer layers, with the decoder mapping sampled (x, y, t) coordinates to corresponding pixels. This reduces both memory and computational costs while preserving video quality.

Based on this, researchers updated CoordTok to efficiently process a video by introducing a hierarchical architecture that grasped local and global features from the video. This architecture represented a factorized triplane to process patches of space and time, making long-duration video processing easier without excessively using computational resources. This approach greatly reduced the memory and computation requirements and maintained high video quality.

Researchers improved the performance by adding a hierarchical structure that captured the local and global features of videos. This structure allowed the model to process space-time patches more efficiently using transformer layers, which helped generate factorized triplane representations. As a result, CoordTok handled longer videos without demanding excessive computational resources. For example, CoordTok encoded a 128-frame video with 128×128 resolution into 1280 tokens, while baselines required 6144 or 8192 tokens to achieve similar reconstruction quality. The model’s reconstruction quality was further improved by fine-tuning with both ℓ2 loss and LPIPS loss, enhancing the accuracy of the reconstructed frames. This combination of strategies reduced memory usage by up to 50% and computational costs while maintaining high-quality video reconstruction, with models like CoordTok-L achieving a PSNR of 26.9.

In conclusion, the proposed framework by researchers, CoordTok, proves to be an efficient video tokenizer that uses coordinate-based representations to reduce computational costs and memory requirements while encoding long videos.

It allows memory-efficient training for video generation models, making handling long videos with fewer tokens possible. However, it is not strong enough for dynamic videos and suggests further potential improvements, such as using multiple content planes or adaptive methods. This work can serve as a starting point for future research on scalable video tokenizers and generation, which can be beneficial for comprehending and generating long videos.

Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

The post CoordTok: A Scalable Video Tokenizer that Learns a Mapping from Co-ordinate-based Representations to the Corresponding Patches of Input Videos appeared first on MarkTechPost.

Deep Learning and Vocal Fold Analysis: The Role of the GIRAFE Dataset

Aswin Ak — Thu, 26 Dec 2024 01:48:27 +0000

Semantic segmentation of the glottal area from high-speed videoendoscopic (HSV) sequences presents a critical challenge in laryngeal imaging. The field faces a significant shortage of high-quality, annotated datasets for training robust segmentation models. Therefore, the development of automatic segmentation technologies is hindered by this limitation and the creation of diagnostic tools such as Facilitative Playbacks (FPs) that are crucial in assessing vibratory dynamics in vocal folds. The limited availability of extensive datasets is a challenge to clinicians while trying to make an accurate diagnosis and proper treatment of voice disorders, generating a vast void in both research works and clinical practices.

Current techniques for glottal segmentation include the classical image processing techniques, which include active contours and watershed transformations. Most of these techniques generally require a considerable amount of manual input and cannot cope with varying illumination conditions or complex scenarios of glottis closure. On the other hand, deep learning models, although promising, are limited by the need for large and high-quality annotated datasets. Datasets like BAGLS, which are available publicly, provide grayscale recordings, but they are less diverse and granular, which in turn reduces their generalization ability for complex segmentation tasks. These factors underline the urgent need for a dataset that offers better versatility, more complex features, and broader clinical relevance.

Researchers from the University of Brest, University of Patras, and Universidad Politécnica de Madrid introduce the GIRAFE dataset to address the limitations of existing resources. GIRAFE is a robust and comprehensive repository comprising 65 HSV recordings from 50 patients, each meticulously annotated with segmentation masks. In contrast to other datasets, the advantage of GIRAFE is that it offers color HSV recordings, which makes subtle anatomical and pathological features visually detectable. This resource enables researchers to make high-resolution assessments involving classical segmentation approaches, such as InP and Loh, and the recent deep neural architectures, such as UNet and SwinUnetV2. Apart from high-resolution segmentation, this work also facilitates Facilitative Playbacks, including GAW, GVG, and PVG, which are the most important media through which vibratory modal patterns in the vocal fold could be visualized to learn more about vocal-fold phonatory dynamics.

The GIRAFE dataset comprises highly extensive features suitable for a wide variety of research. It comprises 760 frames expert-validated and annotated; such a setup allows for proper training and evaluation using correct segmentation masks. This dataset incorporates both traditional image processing techniques such as InP and Loh and also advanced deep learning architectures. HSV recordings are captured at a high temporal resolution of 4000 frames per second with a spatial resolution of 256×256 pixels, ensuring detailed analysis of vocal fold dynamics. The dataset is organized into structured directories, including \\Raw_Data, \\Seg_FP-Results, and \\Training, facilitating ease of access and integration into research pipelines. This combination of systematic arrangement with color recordings makes it easier to view glottal characteristics and allows the exploration of complex vibratory patterns in a wide range of clinical conditions.

The GIRAFE dataset showed its efficiency in the further advancement of segmentation techniques with full validation using both traditional approaches and deep learning. Traditional segmentation techniques, such as the InP method, performed well across different challenging cases, indicating that they are robust and can handle complex cases. Deep learning models like UNet and SwinUnetV2 have also demonstrated good performance; however, UNet outperformed the others in segmentation accuracy in simpler conditions. The diversity of the dataset, containing various pathologies, illumination conditions, and anatomical variations, made it a benchmark resource. These results confirm that the dataset can contribute to improved development and assessment of segmentation methods and support innovation in clinical laryngeal imaging applications.

The GIRAFE dataset represents an important milestone in the landscape of laryngeal imaging research. With its inclusion of color HSV recordings, diverse annotations, and the integration of both traditional and deep learning methodologies, this dataset addresses the limitations inherent in the current datasets and sets a new benchmark within the domain. This dataset helps further bridge traditional and modern approaches while providing a dependable basis for the advancement of sophisticated segmentation methods and diagnostic instruments. Its contributions can potentially change the examination and management of voice disorders, and thus, it would be a great source for clinicians and researchers alike looking to advance the field of vocal fold dynamics and related diagnostics.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

The post Deep Learning and Vocal Fold Analysis: The Role of the GIRAFE Dataset appeared first on MarkTechPost.

Evaluation Agent: A Multi-Agent AI Framework for Efficient, Dynamic, Multi-Round Evaluation, While Offering Detailed, User-Tailored Analyses

Nikhil — Mon, 23 Dec 2024 19:20:21 +0000

Visual generative models have advanced significantly in terms of the ability to create high-quality images and videos. These developments, powered by AI, enable applications ranging from content creation to design. However, the capability of these models depends on the evaluation frameworks used to measure their performance, making efficient and accurate assessments a crucial area of focus.

Existing evaluation frameworks for visual generative models are often inefficient, requiring significant computational resources and rigid benchmarking processes. To measure performance, traditional tools rely heavily on large datasets and fixed metrics, such as FID and FVD. These methods lack flexibility and adaptability, often producing simple numerical scores without deeper interpretive insights. This creates a gap between the evaluation process and user-specific requirements, limiting their practicality in real-world applications.

Traditional benchmarks like VBench and EvalCrafter focus on specific dimensions such as subject consistency, aesthetic quality, and motion smoothness. However, these methods demand thousands of samples for evaluation, leading to high time costs. For instance, benchmarks like VBench require up to 4,355 samples per evaluation, consuming over 4,000 minutes of computation time. Despite their comprehensiveness, these frameworks struggle to adapt to user-defined criteria, leaving room for improvement in efficiency and flexibility.

Researchers from the Shanghai Artificial Intelligence Laboratory and Nanyang Technological University introduced the Evaluation Agent framework to address these limitations. This innovative solution mimics human-like strategies by conducting dynamic, multi-round evaluations tailored to user-defined criteria. Unlike rigid benchmarks, this approach integrates customizable evaluation tools, making it adaptable and efficient. The Evaluation Agent leverages large language models (LLMs) to power its intelligent planning and dynamic evaluation process.

The Evaluation Agent operates through two stages. The system identifies evaluation dimensions based on user input in the Proposal Stage and dynamically selects test cases. Prompts are generated by the PromptGen Agent, which designs tasks aligned with the user’s query. The Execution Stage involves generating visuals based on these prompts and evaluating them using an extensible toolkit. The framework eliminates redundant test cases and uncovers nuanced model behaviors by dynamically refining its focus. This dual-stage process allows for efficient evaluations while maintaining high accuracy.

The framework significantly outperforms traditional methods in terms of efficiency and adaptability. While benchmarks like VBench require thousands of samples and over 4,000 minutes to complete evaluations, the Evaluation Agent achieves similar accuracy using only 23 samples and 24 minutes per model dimension. Across various dimensions, such as aesthetic quality, spatial relationships, and motion smoothness, the Evaluation Agent demonstrated prediction accuracy comparable to established benchmarks while reducing computational costs by over 90%. For instance, the system evaluated models like VideoCrafter-2.0 with a consistency of up to 100% in multiple dimensions.

The Evaluation Agent achieved remarkable results in its experiments. It adapted to user-specific queries, providing detailed, interpretable results beyond numerical scores. It also supported evaluations across text-to-image (T2I) and text-to-video (T2V) models, highlighting its scalability and versatility. Considerable reductions in evaluation time were observed, from 563 minutes with T2I-CompBench to just 5 minutes for the same task using the Evaluation Agent. This efficiency positions the framework as a superior alternative for evaluating generative models in academic and industrial contexts.

The Evaluation Agent offers a transformative approach to visual generative model evaluation, overcoming the inefficiencies of traditional methods. By combining dynamic, human-like evaluation processes with advanced AI technologies, the framework provides a flexible and accurate solution for assessing diverse model capabilities. The substantial reduction in computational resources and time costs highlights its potential for broad adoption, paving the way for more effective evaluations in generative AI.

The post Evaluation Agent: A Multi-Agent AI Framework for Efficient, Dynamic, Multi-Round Evaluation, While Offering Detailed, User-Tailored Analyses appeared first on MarkTechPost.

NOVA: A Novel Video Autoregressive Model Without Vector Quantization

Divyesh Vitthal Jawkhede — Mon, 23 Dec 2024 03:54:57 +0000

Autoregressive LLMs are complex neural networks that generate coherent and contextually relevant text through sequential prediction. These LLms excel at handling large datasets and are very strong at translation, summarization, and conversational AI. However, achieving high quality in vision generation often comes at the cost of increased computational demands, especially for higher resolutions or longer videos. Despite efficient learning with compressed latent spaces, video diffusion models are limited to fixed-length outputs and lack contextual adaptability in autoregressive models like GPT.

Current autoregressive video generation models face many limitations. Diffusion models make excellent text-to-image and text-to-video tasks but rely on fixed-length tokens, which limits their versatility and scalability in video generations. Autoregressive models typically suffer from vector quantization issues because they transform visual data into discrete-valued token spaces. Higher-quality tokens require more tokens, while using these tokens increases the computational cost. While advancements like VAR and MAR improve image quality and generative modeling, their application to video generation remains constrained by inefficiencies in modeling and challenges in adapting to multi-context scenarios.

To address these issues, researchers from BUPT, ICT-CAS, DLUT, and BAAI proposed NOVA, a non-quantized autoregressive model for video generation. NOVA approaches video generation by predicting frames sequentially over time and spatial token sets within each frame in a flexible order. This model combines time-based and space-based prediction by separating how frames and spatial sets are generated. It uses a pre-trained language model to process text prompts and optical flow to track motion. For time-based prediction, the model applies a block-wise causal masking method, while for space-based prediction, it uses a bidirectional approach to predict sets of tokens. The model introduces scaling and shifting layers to improve stability and uses sine-cosine embeddings for better positioning. It also adds diffusion loss to help predict token probabilities in a continuous space, making training and inference more efficient and improving video quality and scalability.

The researchers trained NOVA using high-quality datasets, starting with 16 million image-text pairs from sources like DataComp, COYO, Unsplash, and JourneyDB, which were later expanded to 600 million pairs from LAION, DataComp, and COYO. For text-to-video, researchers used 19 million video-text pairs from Panda–70M and other internal datasets, plus 1 million pairs from Pexels-a caption engine based on Emu2-17B generated descriptions. NOVA’s architecture included a spatial AR layer, a denoising MLP block, and a 16-layer encoder-decoder structure for handling spatial and temporal components. The temporal encoder-decoder dimensions ranged from 768 to 1536, and the denoising MLP had three blocks with 1280 dimensions. A pre-trained VAE model captured image features using masking and diffusion schedulers. NOVA was trained on sixteen A100 nodes with the AdamW optimizer. It was first trained for text-to-image tasks and then for text-to-video tasks.

Results from evaluations on T2I-CompBench, GenEval, and DPG-Bench showed that NOVA outperformed models like PixArt-α and SD v1/v2 in text-to-image and text-to-video generation tasks. NOVA generated higher-quality images and videos with clearer, more detailed visuals. It also provided more accurate results and better matched the text inputs and the generated outputs.

In summary, the proposed NOVA model significantly advances text-to-image and text-to-video generation. The method reduces computational complexity and improves efficiency by integrating temporal frame-by-frame and spatial set-by-set predictions with good-quality outputs. Its performance exceeds existing models, with near-commercial image quality and video fidelity. This work provides a foundation for future research, offering a baseline for developing scalable models and real-time video generation and opening up new possibilities for advancements in the field.

The post NOVA: A Novel Video Autoregressive Model Without Vector Quantization appeared first on MarkTechPost.

This AI Paper from Microsoft and Oxford Introduce Olympus: A Universal Task Router for Computer Vision Tasks

Afeerah Naseem — Sun, 22 Dec 2024 07:38:16 +0000

Computer vision models have made significant strides in solving individual tasks such as object detection, segmentation, and classification. Complex real-world applications such as autonomous vehicles, security and surveillance, and healthcare and medical Imaging require multiple vision tasks. However, each task has its own model architecture and requirements, making efficient management within a unified framework a significant challenge. Current approaches rely on training individual models, making it difficult to scale them to real-world applications that require a combination of those tasks. Researchers at the University of Oxford and Microsoft have devised a novel framework, Olympus, which aims to simplify the handling of diverse vision tasks while enabling more complex workflows and efficient resource utilization.

Traditionally, the Computer vision approaches rely on task-specific Models. These models focus on accomplishing one task efficiently at a time. However, the requirement of separate models for each task increases the computational burden. Multitask learning models exist but often suffer from poor task balancing, resource inefficiency, and performance degradation on complex or underrepresented tasks. Therefore, there is a need for a new method that resolves the scalability issues, adapts to new scenarios dynamically, and effectively utilizes the resources.

At its heart, the proposed framework, Olympus, has a controller, the Multimodal Large Language Model (MLLM), responsible for understanding user instructions and routing them to appropriate specialized modules. The key features of Olympus include:

Task-Aware Routing: The controller MLLM analyses the incoming tasks and efficiently reroutes them to the most suitable specialized model to optimize the computational resources.
Scalable Framework: It can handle up to 20 tasks simultaneously without requiring separate systems and integrate with the existing MLLMs efficiently.
Knowledge Sharing: Different components of Olympus share whatever they have learned with each other, maximizing the output efficiency.
Chain-of-Action Capability: Olympus can handle multiple vision tasks and is highly adaptable to complex real-world applications.

Olympus demonstrated impressive performance across various benchmarks. It achieved an average routing efficiency of 94.75% across 20 individual tasks and attained a precision of 91.82% in scenarios requiring multiple tasks to complete an instruction. The modular routing approach enabled the addition of new tasks with minimal retraining, showcasing its scalability and adaptability.

Olympus: A Universal Task Router for Computer Vision Tasks marks a significant leap in computer vision. Its innovative task-aware routing mechanism and modular knowledge-sharing framework address inefficiency and scalability challenges in multitask learning systems. By achieving impressive routing accuracy, precision in chained action scenarios, and scalability across diverse vision tasks, Olympus establishes itself as a versatile and efficient tool for various applications. While further exploration of edge-case tasks, latency trade-offs, and real-world validation is needed, Olympus paves the way for more integrated and adaptable systems, challenging the traditional task-specific model paradigm. With further developments and implementations, Olympus can change how complex vision problems are handled in different domains. This shall offer a solid base for future computer vision and artificial intelligence developments.

The post This AI Paper from Microsoft and Oxford Introduce Olympus: A Universal Task Router for Computer Vision Tasks appeared first on MarkTechPost.

Meta AI Releases Apollo: A New Family of Video-LMMs Large Multimodal Models for Video Understanding

Asif Razzaq — Tue, 17 Dec 2024 06:33:10 +0000

While multimodal models (LMMs) have advanced significantly for text and image tasks, video-based models remain underdeveloped. Videos are inherently complex, combining spatial and temporal dimensions that demand more from computational resources. Existing methods often adapt image-based approaches directly or rely on uniform frame sampling, which poorly captures motion and temporal patterns. Moreover, training large-scale video models is computationally expensive, making it difficult to explore design choices efficiently.

To tackle these issues, researchers from Meta AI and Stanford developed Apollo, a family of video-focused LMMs designed to push the boundaries of video understanding. Apollo addresses these challenges through thoughtful design decisions, improving efficiency, and setting a new benchmark for tasks like temporal reasoning and video-based question answering.

Meta AI Introduces Apollo: A Family of Scalable Video-LMMs

Meta AI’s Apollo models are designed to process videos up to an hour long while achieving strong performance across key video-language tasks. Apollo comes in three sizes – 1.5B, 3B, and 7B parameters – offering flexibility to accommodate various computational constraints and real-world needs.

Key innovations include:

Scaling Consistency: Design choices made on smaller models are shown to transfer effectively to larger ones, reducing the need for large-scale experiments.
Frame-Per-Second (fps) Sampling: A more efficient video sampling technique compared to uniform frame sampling, ensuring better temporal consistency.
Dual Vision Encoders: Combining SigLIP for spatial understanding with InternVideo2 for temporal reasoning enables a balanced representation of video data.
ApolloBench: A curated benchmark suite that reduces redundancy in evaluation while providing detailed insights into model performance.

Technical Highlights and Advantages

The Apollo models are built on a series of well-researched design choices aimed at overcoming the challenges of video-based LMMs:

Frame-Per-Second Sampling: Unlike uniform frame sampling, fps sampling maintains a consistent temporal flow, allowing Apollo to better understand motion, speed, and sequence of events in videos.
Scaling Consistency: Experiments show that model design choices made on moderately sized models (2B-4B parameters) generalize well to larger models. This approach reduces computational costs while maintaining performance gains.
Dual Vision Encoders: Apollo uses two complementary encoders: SigLIP, which excels at spatial understanding, and InternVideo2, which enhances temporal reasoning. Their combined strengths produce more accurate video representations.
Token Resampling: By using a Perceiver Resampler, Apollo efficiently reduces video tokens without losing information. This allows the models to process long videos without excessive computational overhead.
Optimized Training: Apollo employs a three-stage training process where video encoders are initially fine-tuned on video data before integrating with text and image datasets. This staged approach ensures stable and effective learning.
Multi-Turn Conversations: Apollo models can support interactive, multi-turn conversations grounded in video content, making them ideal for applications like video-based chat systems or content analysis.

Performance Insights

Apollo’s capabilities are validated through strong results on multiple benchmarks, often outperforming larger models:

Apollo-1.5B:
- Surpasses models like Phi-3.5-Vision (4.2B) and LongVA-7B.
- Scores: 60.8 on Video-MME, 63.3 on MLVU, 57.0 on ApolloBench.
Apollo-3B:
- Competes with and outperforms many 7B models.
- Scores: 58.4 on Video-MME, 68.7 on MLVU, 62.7 on ApolloBench.
- Achieves 55.1 on LongVideoBench.
Apollo-7B:
- Matches and even surpasses models with over 30B parameters, such as Oryx-34B and VILA1.5-40B.
- Scores: 61.2 on Video-MME, 70.9 on MLVU, 66.3 on ApolloBench.

Benchmark Summary:

Conclusion

Apollo marks a significant step forward in video-LMM development. By addressing key challenges such as efficient video sampling and model scalability, Apollo provides a practical and powerful solution for understanding video content. Its ability to outperform larger models highlights the importance of well-researched design and training strategies.

The Apollo family offers practical solutions for real-world applications, from video-based question answering to content analysis and interactive systems. Importantly, Meta AI’s introduction of ApolloBench provides a more streamlined and effective benchmark for evaluating video-LMMs, paving the way for future research.

Check out the Paper, Website, Demo, Code, and Models. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

The post Meta AI Releases Apollo: A New Family of Video-LMMs Large Multimodal Models for Video Understanding appeared first on MarkTechPost.

Gaze-LLE: A New AI Model for Gaze Target Estimation Built on Top of a Frozen Visual Foundation Model

Aswin Ak — Tue, 17 Dec 2024 03:47:36 +0000

Accurately predicting where a person is looking in a scene—gaze target estimation—represents a significant challenge in AI research. Integrating complex cues such as head orientation and scene context must be used to infer gaze direction. Traditionally, methods for this problem use multi-branch architectures, processing the scene and head features separately before integrating them with auxiliary inputs, such as depth and pose. However, these methods are computationally intensive, hard to train, and often fail to generalize well across datasets. This calls for overcoming these issues so that the applications in understanding human behavior, robotics, and assistive technologies can progress.

Existing gaze estimation methods heavily depend on multi-branch pipelines, where separate encoders handle the scene and head features, followed by fusion modules to combine these inputs. To improve efficiency, many of these models use additional signals, such as pose, depth, and auxiliary features, which are obtained from specific modules. However, these approaches have several limitations. First, their high computational cost makes real-time implementation impossible. Second, these systems generally require large amounts of labeled training data, which is labor-intensive and nearly impossible to scale. This limits their ability to transfer learned generalizations to numerous environments and datasets when relying on particular encoders with supplementary inputs.

To address these issues, researchers from the Georgia Institute of Technology and the University of Illinois Urbana-Champaign introduced Gaze-LLE, a streamlined and efficient framework for gaze target estimation. Gaze-LLE eliminates the need for complex multi-branch architectures through a static DINOv2 visual encoder and a minimalist decoder module. The framework uses a unified backbone for feature extraction and has an innovative head positional prompting mechanism that allows the gaze estimation to be specific to certain individuals in the scene. Some of the primary contributions of this methodology are reducing trainable parameters to a significant level that translates into 95% fewer computations in comparison with traditional methods. Gaze-LLE is also a successful method in transforming transformer-based encoders at a large scale. It accurately enables gaze estimation without complex auxiliary models and allows for the maintenance of superior performance with minimal adjustments across a range of datasets and tasks through a simple and scalable architecture.

The architecture of Gaze-LLE comprises two main components. First, a frozen DINOv2 visual encoder extracts robust features from the input image, which are then projected into a lower-dimensional space via a linear layer for efficient processing. Second, a lightweight gaze decoder integrates these scene features with a head position embedding that encodes the location of the individual being observed. This mechanism allows the model to focus on the specific source of gaze. The gaze decoder consists of three transformer layers intended to be used for feature enhancement, and it produces a gaze heatmap that indicates possible targets of gaze, as well as an in-frame classification to determine whether the gaze falls within the observable frame. The simple model requires using a straightforward training objective: simply a pixel-wise binary cross-entropy loss, allowing the optimal tuning without a sophisticated approach based on complex multitasking objectives. Benchmarks comprised benchmark datasets: GazeFollow, VideoAttentionTarget, and ChildPlay.

Gaze-LLE achieves state-of-the-art performance across multiple benchmarks with significantly fewer parameters and faster training times. The GazeFollow dataset, yields an AUC of 0.958 and an average L2 error of 0.099, besting prior methods both in precision and in computational efficiency. The training time is, in particular, remarkably efficient, with the model achieving convergence within less than 1.5 GPU hours and significantly outperforming traditional multi-branch architectures. Further, Gaze-LLE also exhibits strong generalization properties as its high performance is retained over several datasets, like ChildPlay and GOO-Real, even without fine-tuning. Results like these show that the frozen foundational models in an optimized architecture can be useful for accurate and flexible gaze estimation.

In summary, Gaze-LLE redefines gaze target estimation with a streamlined and effective framework that brings in fundamental visual encoders and an innovative head positional prompting system. Free from the intricacies of architectures with multiple branches, this achieves higher accuracy, better efficiency, and scalability. Moreover, its ability to generalize across various datasets provides promise for its applications in further research on human behavior and related fields, thus introducing a new benchmark for the advancement of gaze estimation research.

The post Gaze-LLE: A New AI Model for Gaze Target Estimation Built on Top of a Frozen Visual Foundation Model appeared first on MarkTechPost.

Microsoft AI Research Introduces OLA-VLM: A Vision-Centric Approach to Optimizing Multimodal Large Language Models

Nikhil — Mon, 16 Dec 2024 18:43:05 +0000

Multimodal large language models (MLLMs) are advancing rapidly, enabling machines to interpret and reason about textual and visual data simultaneously. These models have transformative applications in image analysis, visual question answering, and multimodal reasoning. By bridging the gap between vision & language, they play a crucial role in improving artificial intelligence’s ability to understand and interact with the world holistically.

Despite their promise, these systems need to overcome significant challenges. A core limitation is the reliance on natural language supervision for training, often resulting in suboptimal visual representation quality. While increasing dataset size and computational complexity have led to modest improvements, they need more targeted optimization for visual understanding within these models to ensure they achieve the desired performance in vision-based tasks. Current methods frequently need to balance computational efficiency and improved performance.

Existing techniques for training MLLMs typically involve using visual encoders to extract features from images and feeding them into the language model alongside natural language data. Some methods employ multiple visual encoders or cross-attention mechanisms to enhance understanding. However, these approaches come at the cost of significantly higher data and computation requirements, limiting their scalability and practicality. This inefficiency underscores the need for a more effective way to optimize MLLMs for visual comprehension.

Researchers at SHI Labs at Georgia Tech and Microsoft Research introduced a novel approach called OLA-VLM to address these challenges. The method aims to improve MLLMs by distilling auxiliary visual information into their hidden layers during pretraining. Instead of increasing visual encoder complexity, OLA-VLM leverages embedding optimization to enhance the alignment of visual and textual data. Introducing this optimization into intermediate layers of the language model ensures better visual reasoning without additional computational overhead during inference.

The technology behind OLA-VLM involves embedding loss functions to optimize representations from specialized visual encoders. These encoders are trained for image segmentation, depth estimation, and image generation tasks. The distilled features are mapped to specific layers of the language model using predictive embedding optimization techniques. Further, special task-specific tokens are appended to the input sequence, allowing the model to incorporate auxiliary visual information seamlessly. This design ensures that the visual features are effectively integrated into the MLLM’s representations without disrupting the primary training objective of next-token prediction. The result is a model that learns more robust and vision-centric representations.

The performance of OLA-VLM was rigorously tested on various benchmarks, showing substantial improvements over existing single- and multi-encoder models. On CV-Bench, a vision-centric benchmark suite, OLA-VLM outperformed the LLaVA-1.5 baseline by up to 8.7% in in-depth estimation tasks, achieving an accuracy of 77.8%. For segmentation tasks, it achieved a mean Intersection over Union (mIoU) score of 45.4%, significantly improving over the baseline’s 39.3%. The model also demonstrated consistent gains across 2D and 3D vision tasks, achieving an average improvement of up to 2.5% on benchmarks like distance and relation reasoning. OLA-VLM achieved these results using only a single visual encoder during inference, making it far more efficient than multi-encoder systems.

To further validate its effectiveness, researchers analyzed the representations learned by OLA-VLM. Probing experiments revealed that the model achieved superior visual feature alignment in its intermediate layers. This alignment significantly enhanced the model’s downstream performance across various tasks. For instance, the researchers noted that integrating special task-specific tokens during training contributed to better optimizing features for depth, segmentation, and image generation tasks. The results underscored the efficiency of the predictive embedding optimization approach, proving its capability to balance high-quality visual understanding with computational efficiency.

OLA-VLM establishes a new standard for integrating visual information into MLLMs by focusing on embedding optimization during pretraining. This research addresses the gap in current training methods by introducing a vision-centric perspective to improve the quality of visual representations. The proposed approach enhances performance on vision-language tasks and achieves this with fewer computational resources compared to existing methods. OLA-VLM exemplifies how targeted optimization during pretraining can substantially improve multimodal model performance.

In conclusion, the research conducted by SHI Labs and Microsoft Research highlights a groundbreaking advancement in multimodal AI. By optimizing visual representations within MLLMs, OLA-VLM bridges a critical gap in performance and efficiency. This method demonstrates how embedding optimization can effectively address challenges in vision-language alignment, paving the way for more robust and scalable multimodal systems in the future.

The post Microsoft AI Research Introduces OLA-VLM: A Vision-Centric Approach to Optimizing Multimodal Large Language Models appeared first on MarkTechPost.