Divyesh Vitthal Jawkhede, Author at MarkTechPost

Collective Monte Carlo Tree Search (CoMCTS): A New Learning-to-Reason Method for Multimodal Large Language Models

Divyesh Vitthal Jawkhede — Sat, 28 Dec 2024 07:32:35 +0000

In today’s world, Multimodal large language models (MLLMs) are advanced systems that process and understand multiple input forms, such as text and images. By interpreting these diverse inputs, they aim to reason through tasks and generate accurate outputs. However, MLLMs often fail at complex tasks because they lack structured processes to break problems into smaller steps and instead provide direct answers without clear intermediate reasoning. These limitations reduce the success and efficiency of MLLMs in solving intricate problems.

Traditional methods for reasoning in multimodal large language models (MLLMs) have many problems. Prompt-based methods, like Chain-of-Thought, use set steps to copy human reasoning but struggle with difficult tasks. Plant-based methods, like Tree or Graph-of-Thought, try to find reasoning paths but are not flexible or reliable. Learning-based methods, like Monte Carlo Tree Search (MCTS), are slow and do not help with deep thinking. Most MLLMs rely on “direct prediction,” giving short answers without clear steps. Although MCTS works well in games and robotics, it is unsuited for MLLMs, and collective learning does not build strong step-by-step reasoning. These issues make it hard for MLLMs to solve complex problems.

To mitigate these issues, a team researchers from Nanyang Technological University, Tsinghua University, Baidu, and Sun Yat-sen University proposed CoMCTS, a framework to improve reasoning-path search in tree search tasks. Instead of relying on one model, it combines multiple pre-trained models to expand and evaluate candidate paths. This approach differs from traditional methods because it uses a more efficient strategy: several models work together, allowing for better performance and reducing errors during the reasoning process.

It consisted of four key steps: Expansion, Simulation, Backpropagation, and Selection. In the Expansion step, several models looked for different solutions simultaneously, increasing the variety of possible answers. In the Simulation step, incorrect or less effective paths were removed, making the search easier. During the Backpropagation step, the models improved by learning from their past mistakes and using that knowledge to make better predictions. The last step used a statistical method to choose the best action for the model to take. Reflective reasoning in this process helped the model learn from previous errors to make better decisions in similar tasks.

The researchers created the Mulberry-260K dataset, which comprised 260K multimodal input questions, combining text instructions and images from various domains, including general multimodal understanding, mathematics, science, and medical image understanding. The dataset was constructed using CoMCTS with training limited to 15K samples to avoid overabundance. The reasoning tasks required an average of 7.5 steps, with most tasks falling within the 6 to 8-step range. CoMCTS was implemented using four models: GPT4o, Qwen2-VL-7B, LLaMA-3.2-11B-Vision-Instruct, and Qwen2-VL-72B. The training process involved a batch size of 128 and a learning rate 1e-5 for two epochs.

The results demonstrated significant performance improvements over the baseline models, with gains of +4.2% and +7.5% for Qwen2-VL-7B and LLaMA-3.2-11B-Vision-Instruct, respectively. Additionally, the Mulberry dataset outperformed reasoning models like LLaVA-Reasoner-8B and Insight-V-8B, showing superior performance on various benchmarks. Upon evaluation, CoMCTS improved its performance by 63.8%. The involvement of reflective reasoning data led to slight improvements in model performance. This reveals the effects of Mulberry-260K and CoMCTS in improving the accuracy and flexibility of reasoning.

In conclusion, the proposed CoMCTS proves to be an approach that improves reasoning in multimodal large language models (MLLMs) by incorporating collective learning into tree search methods. This framework improved the efficiency of searching for a reasoning path, as demonstrated by the Mulberry-260K dataset and the Mulberry model, which surpasses traditional models in complex reasoning tasks. The proposed methods provide valuable insights for future research, can serve as a basis for advancing MLLMs, and can act as a baseline for developing more efficient models capable of handling increasingly complex tasks.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

The post Collective Monte Carlo Tree Search (CoMCTS): A New Learning-to-Reason Method for Multimodal Large Language Models appeared first on MarkTechPost.

CoordTok: A Scalable Video Tokenizer that Learns a Mapping from Co-ordinate-based Representations to the Corresponding Patches of Input Videos

Divyesh Vitthal Jawkhede — Thu, 26 Dec 2024 01:56:36 +0000

Breaking down videos into smaller, meaningful parts for vision models remains challenging, particularly for long videos. Vision models rely on these smaller parts, called tokens, to process and understand video data, but creating these tokens efficiently is difficult. While recent tools achieve better video compression than older methods, they struggle to handle large video datasets effectively. A key issue is their inability to fully utilize temporal coherence, the natural pattern where video frames are often similar over short periods, which video codecs use for efficient compression. These tools are also computationally expensive to train and are limited to short clips, making them not very effective in capturing patterns and processing longer videos.

Current video tokenization methods have high computational costs and struggle to handle long video sequences efficiently. Early approaches used image tokenizers to compress videos frame by frame but ignored the natural continuity between frames, reducing their effectiveness. Later methods introduced spatiotemporal layers, reduced redundancy, and used adaptive encoding, but they still required rebuilding entire video frames during training, which limited them to short clips. Video generation models like autoregressive methods, masked generative transformers, and diffusion models are also limited to short sequences.

To solve this, researchers from KAIST and UC Berkeley proposed CoordTok, which learns a mapping from coordinate-based representations to the corresponding patches of input videos. Motivated by recent advances in 3D generative models, CoordTok encodes a video into factorized triplane representations and reconstructs patches corresponding to randomly sampled (x, y, t) coordinates. This approach allows large tokenizer models to be trained directly on long videos without requiring excessive resources. The video is divided into space-time patches and processed using transformer layers, with the decoder mapping sampled (x, y, t) coordinates to corresponding pixels. This reduces both memory and computational costs while preserving video quality.

Based on this, researchers updated CoordTok to efficiently process a video by introducing a hierarchical architecture that grasped local and global features from the video. This architecture represented a factorized triplane to process patches of space and time, making long-duration video processing easier without excessively using computational resources. This approach greatly reduced the memory and computation requirements and maintained high video quality.

Researchers improved the performance by adding a hierarchical structure that captured the local and global features of videos. This structure allowed the model to process space-time patches more efficiently using transformer layers, which helped generate factorized triplane representations. As a result, CoordTok handled longer videos without demanding excessive computational resources. For example, CoordTok encoded a 128-frame video with 128×128 resolution into 1280 tokens, while baselines required 6144 or 8192 tokens to achieve similar reconstruction quality. The model’s reconstruction quality was further improved by fine-tuning with both ℓ2 loss and LPIPS loss, enhancing the accuracy of the reconstructed frames. This combination of strategies reduced memory usage by up to 50% and computational costs while maintaining high-quality video reconstruction, with models like CoordTok-L achieving a PSNR of 26.9.

In conclusion, the proposed framework by researchers, CoordTok, proves to be an efficient video tokenizer that uses coordinate-based representations to reduce computational costs and memory requirements while encoding long videos.

It allows memory-efficient training for video generation models, making handling long videos with fewer tokens possible. However, it is not strong enough for dynamic videos and suggests further potential improvements, such as using multiple content planes or adaptive methods. This work can serve as a starting point for future research on scalable video tokenizers and generation, which can be beneficial for comprehending and generating long videos.

Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

The post CoordTok: A Scalable Video Tokenizer that Learns a Mapping from Co-ordinate-based Representations to the Corresponding Patches of Input Videos appeared first on MarkTechPost.

NOVA: A Novel Video Autoregressive Model Without Vector Quantization

Divyesh Vitthal Jawkhede — Mon, 23 Dec 2024 03:54:57 +0000

Autoregressive LLMs are complex neural networks that generate coherent and contextually relevant text through sequential prediction. These LLms excel at handling large datasets and are very strong at translation, summarization, and conversational AI. However, achieving high quality in vision generation often comes at the cost of increased computational demands, especially for higher resolutions or longer videos. Despite efficient learning with compressed latent spaces, video diffusion models are limited to fixed-length outputs and lack contextual adaptability in autoregressive models like GPT.

Current autoregressive video generation models face many limitations. Diffusion models make excellent text-to-image and text-to-video tasks but rely on fixed-length tokens, which limits their versatility and scalability in video generations. Autoregressive models typically suffer from vector quantization issues because they transform visual data into discrete-valued token spaces. Higher-quality tokens require more tokens, while using these tokens increases the computational cost. While advancements like VAR and MAR improve image quality and generative modeling, their application to video generation remains constrained by inefficiencies in modeling and challenges in adapting to multi-context scenarios.

To address these issues, researchers from BUPT, ICT-CAS, DLUT, and BAAI proposed NOVA, a non-quantized autoregressive model for video generation. NOVA approaches video generation by predicting frames sequentially over time and spatial token sets within each frame in a flexible order. This model combines time-based and space-based prediction by separating how frames and spatial sets are generated. It uses a pre-trained language model to process text prompts and optical flow to track motion. For time-based prediction, the model applies a block-wise causal masking method, while for space-based prediction, it uses a bidirectional approach to predict sets of tokens. The model introduces scaling and shifting layers to improve stability and uses sine-cosine embeddings for better positioning. It also adds diffusion loss to help predict token probabilities in a continuous space, making training and inference more efficient and improving video quality and scalability.

The researchers trained NOVA using high-quality datasets, starting with 16 million image-text pairs from sources like DataComp, COYO, Unsplash, and JourneyDB, which were later expanded to 600 million pairs from LAION, DataComp, and COYO. For text-to-video, researchers used 19 million video-text pairs from Panda–70M and other internal datasets, plus 1 million pairs from Pexels-a caption engine based on Emu2-17B generated descriptions. NOVA’s architecture included a spatial AR layer, a denoising MLP block, and a 16-layer encoder-decoder structure for handling spatial and temporal components. The temporal encoder-decoder dimensions ranged from 768 to 1536, and the denoising MLP had three blocks with 1280 dimensions. A pre-trained VAE model captured image features using masking and diffusion schedulers. NOVA was trained on sixteen A100 nodes with the AdamW optimizer. It was first trained for text-to-image tasks and then for text-to-video tasks.

Results from evaluations on T2I-CompBench, GenEval, and DPG-Bench showed that NOVA outperformed models like PixArt-α and SD v1/v2 in text-to-image and text-to-video generation tasks. NOVA generated higher-quality images and videos with clearer, more detailed visuals. It also provided more accurate results and better matched the text inputs and the generated outputs.

In summary, the proposed NOVA model significantly advances text-to-image and text-to-video generation. The method reduces computational complexity and improves efficiency by integrating temporal frame-by-frame and spatial set-by-set predictions with good-quality outputs. Its performance exceeds existing models, with near-commercial image quality and video fidelity. This work provides a foundation for future research, offering a baseline for developing scalable models and real-time video generation and opening up new possibilities for advancements in the field.

The post NOVA: A Novel Video Autoregressive Model Without Vector Quantization appeared first on MarkTechPost.

Mix-LN: A Hybrid Normalization Technique that Combines the Strengths of both Pre-Layer Normalization and Post-Layer Normalization

Divyesh Vitthal Jawkhede — Sun, 22 Dec 2024 20:46:36 +0000

The Large Language Models (LLMs) are highly promising in Artificial Intelligence. However, despite training on large datasets covering various languages

and topics, the ability to understand and generate text is sometimes overstated. LLM applications across multiple domains have proven to have little impact on improving human-computer interactions or creating innovative solutions. This is because the deep layers of the LLMS don’t contribute much and, if removed, don’t affect their performance. This underutilization of deep layers shows inefficiency within the models.

Current methods showed that deeper layers of LLMs contributed little to their performance. Although used to stabilize training, techniques like pre-LN and post-LN showed significant limitations. Pre-LN reduced the magnitude of gradients in deeper layers, limiting their effectiveness, while post-LN caused gradients to vanish in earlier layers. Despite efforts to address these issues through dynamic linear combinations and Adaptive Model Initialization, these techniques do not fully optimize LLM performance.

To address this issue, researchers from the Dalian University of Technology, the University of Surrey, the Eindhoven University of Technology, and the University of Oxford proposed Mix-LN. This normalization technique combines the strengths of Pre-LN and Post-LN within the same model. Mix-LN applies Post-LN to the earlier layers and Pre-LN to the deeper layers to ensure more uniform gradients. This approach allows both shallow and deep layers to contribute effectively to training. The researchers evaluated the hypothesis that deeper layers in LLMs were inefficient due to pre-LN. The main difference between post-LN and pre-LN architectures is layer normalization (LN) placement. In post-LN, LN is applied after the residual addition, while in pre-LN, it is used before.

Researchers compared pre- and post-LN models in large-scale open-weight and small-scale in-house LLMs. Metrics such as angular distance and performance drop assessed layer effectiveness. Early layers were less effective in BERT-Large (Post-LN) than in deeper layers. In LLaMa2-7B (Pre-LN), deeper layers were less effective, and pruning them showed minimal performance impact. Researchers observed similar trends in LLaMa-130M, where Pre-LN layers were less effective at deeper levels, and Post-LN maintained better performance in deeper layers. These results suggested that Pre-LN caused the inefficiency of deeper layers.

The optimal Post-LN ratio α for Mix-LN was determined through experiments with LLaMA-1B on the C4 dataset. The best performance occurred at α = 0.25, where perplexity was lowest. For the remaining layers, performance decreased but remained higher than the performance recorded by Pre-LN compared to the layers that adopted Post-LN. Mix-LN also supported a broader range of representations and maintained a healthier gradient norm for deeper layers to contribute effectively. Mix-LN achieved significantly low perplexity scores, outperforming other normalization methods.

In conclusion, the researchers identified inefficiencies caused by Pre-LN in deep layers of large language models (LLMs) and proposed Mix-LN as a solution. Experiments showed that Mix-LN outperformed both Pre-LN and Post-LN, improving model performance during pre-training and fine-tuning without increasing model size. This approach can act as a baseline for future research, offering a foundation for further enhancements in training deep models and advancing model efficiency and capacity.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

The post Mix-LN: A Hybrid Normalization Technique that Combines the Strengths of both Pre-Layer Normalization and Post-Layer Normalization appeared first on MarkTechPost.

Apple Researchers Introduce ARMADA: An AI System for Augmenting Apple Vision Pro with Real-Time Virtual Robot Feedback

Divyesh Vitthal Jawkhede — Sat, 21 Dec 2024 21:52:27 +0000

Imitation learning (IL) is one of the methods in robotics where robots are trained to mimic human actions based on expert demonstrations. This method relies on supervised machine learning and requires significant human-generated data to guide the robot’s behavior. Although effective for complex tasks, imitation learning is limited by the lack of large-scale datasets and challenges in scaling data collection, unlike language and vision models. Learning from human video demonstrations faces big challenges because robots cannot match the sensitivity and flexibility of human hands. These differences make it hard for imitation learning to work effectively or scale up for general robot tasks.

Traditional imitation learning (IL) relied on human-operated robots, which were effective but faced significant limitations. These systems are based on teleoperation via gloves, motion capture, and VR devices and rely on complex setups and the low-latency control loop. They also relied on physical robots and special-purpose hardware, which was difficult to scale. Although robots could perform tasks such as inserting batteries or tying shoelaces using expert data collected by these approaches, the need for special equipment made such approaches impractical for large-scale or more general use.

To solve this, a group of researchers from Apple and the University of Colorado Boulder proposed the ARMADA system, which integrates the Apple Vision Pro headset with external robot control using a combination of ROS and WebSockets. This setup let communication between the devices, where the system could be plug-and-play and was flexible to many robot platforms, such as Franka and UR5, by only replacing 3D model files and data formatting for the headset. The ARMADA app handled robot visualization, data storage, and a user interface, receiving transformation frames for robot links, capturing image frames from cameras, and tracking human skeleton data for processing. The robot node managed control, data storage, and constraint calculation, transforming skeletal data into robot commands and detecting workspace violations, singularities, and speed issues for real-time feedback.

The robot’s movements were aligned with human wrist and finger positions, tracked through ARKit in vision 2.0, using inverse kinematics to calculate joint positions and control a gripper based on finger spacing. Constraints like singularity, workspace limits, and speed violations were visualized through color changes, virtual boundaries, or on-screen text. Researchers used the ARMADA system to perform three tasks: picking a tissue from a box, placing a toy into a cardboard box, and wiping a table with both hands. Each task had five starting states, and success was based on specific criteria. Wearing Apple Vision Pro with ARMADA software on visionOS 2.0, participants provided 45 demonstrations under three feedback conditions: No Feedback, Feedback, and Post Feedback. Wrist and finger movements were tracked in real-time using ARKit, and robot movements were controlled via inverse kinematics, with joint trajectories recorded for replay.

Upon evaluation, the results showed that feedback visualization significantly improved replay success rates for tasks like Pick Tissue, Declutter, and Bimanual Wipe, with gains of up to 85% compared to no feedback. Post-feedback demonstrations also showed improvements but were less effective than real-time feedback. Participants found the feedback intuitive and useful for understanding robot motion, and the system worked well for users with varying experience levels. Common failure modes without feedback included imprecise robot poses and gripper issues. Participants adjusted their behavior during demonstrations, slowing down and changing hand positions, and could visualize feedback after removing it.

In summary, the proposed ARMADA system addressed the challenge of scalable data collection for robot imitation learning by using augmented reality for real-time feedback to improve data quality and compatibility with physical robots. The results showed the importance of feedback for aligning robot-free demonstrations with real robot kinematics. While the study focused on simpler tasks, future research can explore more complex ones and refine techniques. This system can serve as a baseline for future robotics research, particularly in training robot control policies through imitation learning with visual observations.

The post Apple Researchers Introduce ARMADA: An AI System for Augmenting Apple Vision Pro with Real-Time Virtual Robot Feedback appeared first on MarkTechPost.

Slow Thinking with LLMs: Lessons from Imitation, Exploration, and Self-Improvement

Divyesh Vitthal Jawkhede — Fri, 20 Dec 2024 03:13:23 +0000

Reasoning systems such as o1 from OpenAI were recently introduced to solve complex tasks using slow-thinking processes. However, it is clear that large language models have limitations, as they cannot plan, break down problems, improve ideas, summarize, or rethink due to their training and methods. While these tools try to enhance reasoning, they depend on structured guidance and extra processing time, raising doubts about their ability to handle complex tasks without regular human help.

Current methods in reasoning systems are mostly based on fast-thinking approaches, thus providing quick responses but with less depth and accuracy. The industry has mostly developed and maintained these systems, but their core techniques are not disclosed publicly. They usually fail in extended thinking, thus considerably limiting their ability to solve complex problems. Methods like tree search and reward models were used in some systems, but they were not very effective in generalizing across domains or were too slow for real-world use. New systems used test-time scaling to give more time for processing and generating detailed reasoning steps called thoughts to find solutions. Fine-tuning large language models with long chains of thought has also improved performance on complex tasks.

To solve this, researchers from the Gaoling School of Artificial Intelligence, Renmin University of China, and BAAI proposed a solution that involves a three-phase framework called “imitate, explore, and self-improve” to enhance reasoning in language models. Researchers presented a three-phase training method—imitation, exploration, and self-improvement—for developing reasoning models similar to OpenAI’s o1 system.

The model was trained to follow specific formats in the imitation phase, using minimal data to generate reasoning and solutions. During the exploration phase, the model focused on difficult problems, developing multiple solutions and improving them based on the correct answers, especially for tasks requiring slow thinking. In the self-improvement phase, high-quality data and techniques like supervised fine-tuning (SFT) and direct preference optimization (DPO) were used to boost the model’s reasoning skills. Metrics like length and perplexity helped filter out low-quality data. However, there weren’t enough challenging problems, and reinforcement learning wasn’t used due to limited resources. The approach focused on improving the model’s reasoning abilities through continuous refinement.

Researchers evaluated the framework using three challenging benchmarks: MATH-OAI, AIME2024, and GPQA. MATH-OAI included 500 competition mathematics problems, AIME2024 featured 30 issues for high school students, and GPQA had 198 multiple-choice questions in biology, physics, and chemistry. The focus was on mathematics, with Qwen2.5-32B-Instruct as the backbone model, compared to models like o1-preview, DeepSeek-R1-LitePreview, and QwQ-32B. The experiments used a greedy search with up to 32k tokens.

Results showed that slow-thinking systems like o1-preview performed well, particularly on AIME, while distillation and exploration-based training also yielded competitive outcomes. Models with 3.9k instances from distillation achieved 90.2% accuracy on MATH-OAI and 46.7% on AIME. Iterative SFT and exploration training improved performance on benchmarks like AIME and MATH-OAI, with variants trained on 1.1k instances showing consistent gains. However, performance fluctuated due to limited exploration capacity, especially on AIME, which had fewer test samples. The analysis indicated that excluding hard problems reduced performance while mixing mathematical and other domain data enhanced reasoning abilities. Further DPO analysis showed that aligning only the thought process with SFT led to stable optimization, although more experiments were needed to refine the strategies. This maintained a good balance of iterative training, distillation, and exploration strategies to support improvement across all the benchmarks.

In summary, the researchers presented a slow-thinking framework for enhancing reasoning systems, demonstrating its effectiveness in solving complex problems across domains. Based on training with high-quality, long-form thought data, the approach enables models to generalize and handle difficult tasks, particularly in mathematics. The system benefits from self-improvement through exploration and flexible thought processes. However, the research is still in its early stages, and there remains a gap in performance compared to industry-level systems. In the future, this domain can be developed, and this framework can act as a baseline for upcoming researchers!

The post Slow Thinking with LLMs: Lessons from Imitation, Exploration, and Self-Improvement appeared first on MarkTechPost.

CMU Researchers Propose miniCodeProps: A Minimal AI Benchmark for Proving Code Properties

Divyesh Vitthal Jawkhede — Wed, 18 Dec 2024 16:39:00 +0000

Recently, AI agents have demonstrated very promising developments in automating mathematical theorem proving and code correctness verification using tools like Lean. Such tools pair code with specifications and proofs to ensure it meets its intended requirements, offering very strong safeguards in safety-critical applications. Artificial Intelligence has demonstrated that it can enable the fundamental steps of solution development, namely coding, specifying, and proving, through large language models. While these advances promise much, fully automating program verification remains challenging.

Traditionally, mathematical theorem proving has relied on tools like Lean, which train models on datasets such as Mathlib to solve problems using specific definitions and strategies. However, these tools have struggled to adapt to program verification, which requires entirely different methods and approaches. While machine learning has improved automation in systems like Coq and Isabelle, similar advancements for Lean in program verification are still missing. Other tools like Dafny and Verus, as well as benchmarks like miniF2F and CoqGym, offer alternatives. Still, they have not been able to fully address the challenges of adapting mathematical theorem-proving methods to the needs of program verification.

To solve this, researchers from Carnegie Mellon University proposed miniCodeProps, a benchmark containing 201 program specifications in the Lean proof assistant, to address the challenge of automatically generating proofs for programs and their specifications. miniCodeProps contained simple, self-contained programs like lists, natural numbers, and binary trees, with varying difficulty levels for proving. The dataset, divided into three categories—intuitive properties of lists, trees, and numbers (medley), termination lemmas for recursive functions (termination), and properties of nonstandard sorting algorithms (sorting)—included 201 theorem statements. The functions primarily operated on linked lists, with some involving natural numbers and binary trees. These properties were categorized by difficulty: easy (medley), medium (termination), and hard (sorting). Termination lemmas required proving recursion termination, which was critical for Lean 4’s use. The dataset, available in jsonlines format, included essential details such as the proof state and dependencies for each theorem. Examples like the zip over concatenation property and sorting properties highlighted the challenge of proving these properties, especially for more complex sorting algorithms.

The evaluation of miniCodeProps focused on two main tasks: full-proof generation and tactic-by-tactic generation. In full-proof generation, models were tested on their ability to generate complete proofs for given specifications. For tactic-by-tactic generation, models were evaluated based on their ability to suggest the next appropriate tactic from the current proof state, testing incremental reasoning. The evaluation also considered the difficulty levels of the proofs, ranging from simple properties of lists and numbers to complex termination and sorting algorithm properties, measuring both efficiency and correctness in proof generation or tactic application.

The results indicated that neural theorem provers, such as GPT-4o, performed well on simpler tasks, achieving a 75.6% success rate on medley properties. However, performance on the harder tasks, such as termination and sorting, was lower, at 4.34% and 6.96%, respectively. The Mathlib-trained model ntp-ctx-1.3B demonstrated similar efficiency to GPT-4o, suggesting the potential for domain-specific verifiers to show further promise. MiniCodeProps provides a framework for improving automated theorem-proving agents for code verification, supporting human engineers, and offering additional guarantees through diverse reasoning approaches.

In the end, the proposed miniCodeProps is a valuable benchmark that can be used to advance automated ITP-based code verification. It contains problems from a range of Inductive problem datasets, which enables stepwise progress in checking program properties. However, the method showed limitations and cannot effectively solve complicated problems. MiniCodeProps can potentially drive advancements in verification agents and serve as a baseline for evaluating new approaches in automated code verification.

The post CMU Researchers Propose miniCodeProps: A Minimal AI Benchmark for Proving Code Properties appeared first on MarkTechPost.

Researchers from UCLA and Apple Introduce STIV: A Scalable AI Framework for Text and Image Conditioned Video Generation

Divyesh Vitthal Jawkhede — Sat, 14 Dec 2024 06:41:48 +0000

Video generation has improved with models like Sora, which uses the Diffusion Transformer (DiT) architecture. While text-to-video (T2V) models have advanced, they often find it hard to create clear and consistent videos without extra references. Text-image-to-video (TI2V) models address this limitation by using an initial image frame as grounding to improve clarity. Reaching Sora-level performance is still difficult as it is hard to combine image-based inputs with the model effectively, and higher-quality datasets are needed to improve the model’s output, making it tough to achieve the same level of success as Sora.

Current methods explored integrating image conditions into U-Net architectures, but applying these techniques to DiT models remained unresolved. While diffusion-based approaches dominated text-to-video generation by using LDMs, scaling models, and shifting to transformer-based architectures, many studies focused on isolated aspects, overlooking their combined impact on performance. Techniques like cross-attention in PixArt-α, self-attention in SD3, and stability tricks such as QK–norm showed some improvements but became less effective as models scaled. Despite advancements, no unified model successfully combined T2V and TI2V capabilities, limiting progress toward more efficient and versatile video generation.

To solve this, researchers from Apple and the University of California developed a comprehensive framework that systematically examined the interaction between model architectures, training methods, and data curation strategies. The resulting STIV method is a simple and scalable text-image-conditioned video generation approach. Using frame replacement, it incorporates image conditions into a Diffusion Transformer (DiT) and applies text conditioning through a joint image-text conditional classifier-free guidance. This design enables STIV to perform text-to-video (T2V) and text-image-to-video simultaneously (TI2V) tasks. Additionally, STIV can be easily expanded to applications like video prediction, frame interpolation, multi-view generation, and long video generation.

Researchers investigated the setup, training, and evaluation process for text-to-video (T2V) and text-to-image (T2I) models. The models used the AdaFactor optimizer, with a specific learning rate and gradient clipping, and were trained for 400k steps. Data preparation involved a video data engine that analyzed video frames, performed scene segmentation, and extracted features like motion and clarity scores—the training utilized curated datasets, including over 90 million high-quality video-caption pairs. Key evaluation metrics, including temporal quality, semantic alignment, and video-image alignment, were assessed using VBench, VBench-I2V, and MSRVTT. The study also explored ablation techniques, such as using different architectural designs and training strategies, including Flow Matching, CFG-Renormalization, and AdaFactor Optimizer. Experiments on model initialization showed that joint initialization from lower and higher resolution models improved performance. Additionally, using more frames during training enhanced metrics, particularly motion smoothness and dynamic range.

The T2V and STIV models significantly improved after scaling from 600M to 8.7B parameters. In T2V, the VBench-Semantic score increased from 72.5 to 74.8 with larger model sizes and improved to 77.0 when the resolution was raised from 256 to 512. Fine-tuning with high-quality data boosted the VBench-Quality score from 82.2 to 83.9, with the best model achieving a VBench-Semantic score of 79.5. Similarly, the STIV model showed advancements, with the STIV-M-512 model achieving a VBench-I2V score of 90.1. In video prediction, the STIV-V2V model outperformed T2V with an FVD score of 183.7 compared to 536.2. The STIV-TUP model delivered fantastic results in frame interpolation, with FID scores of 2.0 and 5.9 on MSRVTT and MovieGen datasets. In the multi-view generation, the proposed STIV model maintained the 3D coherency and achieved comparable performance to Zero123++ with Pa SNR of 21.64 and LPIPS of 0.156. In long video generation, it generated 380 frames, which showed its performance with potential for further progress.

In the end, the proposed framework provided a scalable and flexible solution for video generation by integrating text and image conditioning within a unified model. It demonstrated strong performance on public benchmarks and adaptability across various applications, including controllable video generation, video prediction, frame interpolation, long video generation, and multi-view generation. This approach highlighted its potential to support future advancements in video generation and contribute to the broader research community!

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

The post Researchers from UCLA and Apple Introduce STIV: A Scalable AI Framework for Text and Image Conditioned Video Generation appeared first on MarkTechPost.

This AI Paper Introduces A Maximum Entropy Inverse Reinforcement Learning (IRL) Approach for Improving the Sample Quality of Diffusion Generative Models

Divyesh Vitthal Jawkhede — Fri, 13 Dec 2024 00:39:21 +0000

Diffusion models are closely linked to imitation learning because they generate samples by gradually refining random noise into meaningful data. This process is guided by behavioral cloning, a common imitation learning approach where the model learns to copy an expert’s actions step by step. For diffusion models, the predefined process transforms noise into a final sample, and following this process ensures high-quality results in various tasks. However, behavioral cloning also causes slow generation speed. This happens because the model is trained to follow a detailed path with many small steps, often requiring hundreds or thousands of calculations. However, these steps are computationally expensive in terms of time and require a lot of computation, and taking fewer steps to generate reduces the quality of the model.

Current methods optimize the sampling process without changing the model, such as tuning noise schedules, improving differential equation solvers, and using non–Markovian methods. Others enhance the quality of the sample by training neural networks for short-run sampling. Distillation techniques show promise but usually perform below teacher models. However, adversarial or reinforcement learning methods may surpass them. RL updates the diffusion models based on reward signals using policy gradients or different value functions.

To solve this, researchers from the Korea Institute for Advanced Study, Seoul National University, University of Seoul, Hanyang University, and Saige Research proposed two advancements in diffusion models. The first approach, called Diffusion by Maximum Entropy Inverse Reinforcement Learning (DxMI), combined two methods: diffusion and Energy-Based Models (EBM). In this method, EBM used rewards to measure how good the results were. The goal was to adjust the reward and entropy (uncertainty) in the diffusion model to make training more stable and ensure that both models performed well with the data. The second advancement, Diffusion by Dynamic Programming (DxDP), introduced a reinforcement learning algorithm that simplified entropy estimation by optimizing an upper bound of the objective and eliminated the need for back-propagation through time by framing the task as an optimal control problem, applying dynamic programming for faster and more efficient convergence.

The experiments demonstrated DxMI’s effectiveness in training diffusion and energy-based models (EBMs) for tasks like image generation and anomaly detection. For 2D synthetic data, DxMI improved sample quality and energy function accuracy with a proper entropy regularization parameter. It was demonstrated that pre-training with DDPM is useful but unnecessary for DxMI to function. DxMI fine-tuned models such as DDPM and EDM with fewer generation steps for image generation, which were competitive in quality. In anomaly detection, the energy function of DxMI performed better in detecting and localizing anomalies on the MVTec-AD dataset. Entropy maximization improved performance by promoting exploration and increasing model diversity.

In summary, the proposed method greatly advances the efficiency and quality of diffusion generative models by using the DxMI approach. It solves the issues of previous methods, such as slow generation speeds and degraded sample quality. However, it is not directly suitable for training single-step generators, but a diffusion model fine-tuned by DxMI can be converted into one. DxMI lacks the flexibility to use different generation steps during testing. This method can be used for upcoming research in this domain and serve as a baseline, making a significant difference!

The post This AI Paper Introduces A Maximum Entropy Inverse Reinforcement Learning (IRL) Approach for Improving the Sample Quality of Diffusion Generative Models appeared first on MarkTechPost.

From Scale to Density: A New AI Framework for Evaluating Large Language Models

Divyesh Vitthal Jawkhede — Tue, 10 Dec 2024 05:24:08 +0000

Large language models (LLMs) have made important advances in artificial intelligence, with superior performance on various tasks as their parameters and training data grow. GPT-3, PaLM, and Llama-3.1 perform well in many applications with billions of parameters. However, when implemented in low-power platforms, scaling LLMs poses severe difficulties regarding training and inference queries. While it was still experimental and rare, scaling proved efficient in reaching a larger number of people over time, and as the process progressed, it became very unsustainable. It is also necessary to enable the possibility of applying LLMs on devices with little computational power to address more fundamental aspects of reasoning and produce more tokens.

Current methods for optimizing large language models comprise scaling, pruning, distillation, and quantization. Scaling enhances performance by increasing parameters but demands higher resources. Pruning removes less critical model components to reduce size but often sacrifices performance. Distillation trains smaller models to replicate larger ones but typically results in lower density. Quantization reduces numerical precision for efficiency but may degrade results. These methods fail to balance efficiency and performance well, so there is a shift toward optimizing “density” as a more sustainable metric for developing large language models.

To solve this, researchers from Tsinghua University and ModelBest Inc. proposed the concept of “Capability density” as a new metric to evaluate the quality of LLMs across different scales and describe their trends in terms of effectiveness and efficiency. The density of Large Language Models (LLMs) is the ratio of effective parameter size to actual parameter size. The effective parameter size represents the number of parameters needed by a reference model to match the performance of a given model. This is estimated using the Scaling Law in two steps: (1) fitting a function between parameter size and language model loss and (2) predicting downstream task performance using a sigmoid function. The effective parameter size is computed after fitting loss and performance. Model density is calculated as the ratio of effective parameter size to actual size, where higher density suggests better performance per parameter. It is a very useful concept for optimizing models, mainly for deployment on resource-limited devices.

Researchers analyzed 29 open-source pre-trained models and evaluated the performance of large language models (LLMs) on various datasets, including MMLU, BBH, MATH, HumanEval, and MBPP, under few-shot settings like 5-shot, 3-shot, and 0-shot, utilizing open-source tools for benchmarking. The models were trained with varying parameter sizes, token lengths, and data scales, applying techniques such as chain-of-thought prompting and different learning rate schedulers. The performance scaling curves were obtained by training models on different token sizes, with models like Llama, Falcon, MPT, Phi, Mistral, and MiniCPM being tested across various configurations. Over time, the density of these models increased significantly, with newer models, such as MiniCPM-3-4B, achieving higher densities than older models. A linear regression model indicated that the LLM density doubles approximately every 95 days. This means that designs with more modest capabilities and lower costs will soon be able to compete with bigger, more complicated models, and technological advances will open the way to even more efficient designs.

In conclusion, the proposed method highlighted exponentially increasing capability density in LLMs by showing rapid development and efficiency improvements. The evaluation results on some widely used LLM benchmarks indicated that the density of LLMs doubles every three months. Researchers also proposed the shift towards inference FLOPs for evaluating density by considering deeper reasoning. This method can be used for upcoming research and can be a turning point in the domain of LLMs!

[Must Subscribe]: Subscribe to our newsletter to get trending AI research and dev updates

The post From Scale to Density: A New AI Framework for Evaluating Large Language Models appeared first on MarkTechPost.