Mohammad Asjad, Author at MarkTechPost

Why Do Task Vectors Exist in Pretrained LLMs? This AI Research from MIT and Improbable AI Uncovers How Transformers Form Internal Abstractions and the Mechanisms Behind in-Context Learning (ICL)

Mohammad Asjad — Tue, 24 Dec 2024 02:00:02 +0000

Large Language Models (LLMs) have demonstrated remarkable similarities to human cognitive processes’ ability to form abstractions and adapt to new situations. Just as humans have historically made sense of complex experiences through fundamental concepts like physics and mathematics, autoregressive transformers now show comparable capabilities through in-context learning (ICL). Recent research has highlighted how these models can adapt to tricky tasks without parameter updates, suggesting the formation of internal abstractions similar to human mental models. Studies have begun exploring the mechanistic aspects of how pretrained LLMs represent latent concepts as vectors in their representations. However, questions remain about the underlying reasons for these task vectors’ existence and their varying effectiveness across different tasks.

Researchers have proposed several theoretical frameworks to understand the mechanisms behind in-context learning in LLMs. One significant approach views ICL through a Bayesian framework, suggesting a two-stage algorithm that estimates posterior probability and likelihood. Parallel to this, studies have identified task-specific vectors in LLMs that can trigger desired ICL behaviors. At the same time, other research has revealed how these models encode concepts like truthfulness, time, and space as linearly separable representations. Through mechanistic interpretability techniques such as causal mediation analysis and activation patching, researchers have begun to uncover how these concepts emerge in LLM representations and influence downstream ICL task performance, demonstrating that transformers implement different algorithms based on inferred concepts.

Researchers from the Massachusetts Institute of Technology and Improbable AI introduce the concept encoding-decoding mechanism, providing a compelling explanation for how transformers develop internal abstractions. Research on a small transformer trained on sparse linear regression tasks reveals that concept encoding emerges as the model learns to map different latent concepts into distinct, separable representation spaces. This process operates in tandem with the development of concept-specific ICL algorithms through concept decoding. Testing across various pretrained model families, including Llama-3.1 and Gemma-2 in different sizes, demonstrates that larger language models exhibit this concept encoding-decoding behavior when processing natural ICL tasks. The research introduces Concept Decodability as a geometric measure of internal abstraction formation, showing that earlier layers encode latent concepts while latter layers condition algorithms on these inferred concepts, with both processes developing interdependently.

The theoretical framework for understanding in-context learning draws heavily from a Bayesian perspective, which proposes that transformers implicitly infer latent variables from demonstrations before generating answers. This process operates in two distinct stages: latent concept inference and selective algorithm application. Experimental evidence from synthetic tasks, particularly using sparse linear regression, demonstrates how this mechanism emerges during model training. When trained on multiple tasks with different underlying bases, models develop distinct representational spaces for different concepts while simultaneously learning to apply concept-specific algorithms. The research reveals that concepts sharing overlaps or correlations tend to share representational subspaces, suggesting potential limitations in how models distinguish between related tasks in natural language processing.

The research provides compelling empirical validation of the concept encoding-decoding mechanism in pretrained Large Language Models across different families and scales, including Llama-3.1 and Gemma-2. Through experiments with part-of-speech tagging and bitwise arithmetic tasks, researchers demonstrated that models develop more distinct representational spaces for different concepts as the number of in-context examples increases. The study introduces Concept Decodability (CD) as a metric to quantify how well latent concepts can be inferred from representations, showing that higher CD scores correlate strongly with better task performance. Notably, concepts frequently encountered during pretraining, such as nouns and basic arithmetic operations, show clearer separation in representational space compared to more complex concepts. The research further demonstrates through finetuning experiments that early layers play a crucial role in concept encoding, with modifications to these layers yielding significantly better performance improvements than changes to later layers.

The concept encoding-decoding mechanism provides valuable insights into several key questions about Large Language Models’ behavior and capabilities. The research addresses the varying success rates of LLMs across different in-context learning tasks, suggesting that performance bottlenecks can occur at both the concept inference and algorithm decoding stages. Models show stronger performance with concepts frequently encountered during pretraining, such as basic logical operators, but may struggle even with known algorithms if concept distinction remains unclear. The mechanism also explains why explicit modeling of latent variables doesn’t necessarily outperform implicit learning in transformers, as standard transformers naturally develop effective concept encoding capabilities. Also, this framework offers a theoretical foundation for understanding activation-based interventions in LLMs, suggesting that such methods work by directly influencing the encoded representations that guide the model’s generation process.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

The post Why Do Task Vectors Exist in Pretrained LLMs? This AI Research from MIT and Improbable AI Uncovers How Transformers Form Internal Abstractions and the Mechanisms Behind in-Context Learning (ICL) appeared first on MarkTechPost.

Scaling Language Model Evaluation: From Thousands to Millions of Tokens with BABILong

Mohammad Asjad — Fri, 20 Dec 2024 06:49:59 +0000

Large Language Models (LLMs) and neural architectures have significantly advanced capabilities, particularly in processing longer contexts. These improvements have profound implications for various applications. Enhanced context handling enables models to generate more accurate and contextually relevant responses by utilizing comprehensive information. The expanded context capacity has significantly strengthened in-context learning capabilities, allowing models to utilize more examples and follow complex instructions effectively. Despite these technological leaps, evaluation benchmarks have not evolved correspondingly. Current assessment tools like Longbench and L-Eval remain limited to 40,000 tokens. At the same time, modern models can process hundreds of thousands or even millions of tokens, creating a significant gap between model capabilities and evaluation methods.

The evolution of long-context evaluation benchmarks began with Long Range Arena (LRA), which handled sequences up to 16,000 tokens but focused primarily on specialized tasks like ListOps and Byte-Level operations. This limitation prompted the development of more comprehensive evaluation frameworks. Notable among these are LongBench, Scrolls, and L-Eval, which incorporate diverse tasks ranging from summarization to code completion, with token lengths varying from 3,000 to 60,000. Recent developments have produced more specialized benchmarks focusing on in-context learning and instruction, such as LongAlign and LongICLBench. Additional datasets like InfinityBench, NovelQA, and ChapterBreak have pushed boundaries further, handling up to 636,000 tokens and covering domains from Wikipedia articles to movie scripts.

Researchers from AIRI, Moscow, Russia, Neural Networks and Deep Learning Lab, MIPT, Dolgoprudny, Russia, and London Institute for Mathematical Sciences, London, UK introduce BABILong, an innovative benchmark designed to evaluate language models’ reasoning capabilities across extremely long documents. This comprehensive evaluation framework encompasses 20 distinct reasoning tasks, including fact chaining, induction, deduction, and list handling, utilizing books from the PG19 corpora as source material. The benchmark’s flexibility allows for testing sequences of up to 50 million tokens, making it uniquely suited for evaluating next-generation models. Initial testing reveals significant limitations in current models, with popular LLMs effectively utilizing only 10-20% of available context. While Retrieval-Augmented Generation methods achieve 60% accuracy on single-fact questions, architectural innovations like Mamba and Recurrent Memory Transformers demonstrate superior performance, with ARMT notably processing sequences up to 50 million tokens.

The BABILong benchmark employs a distinctive methodology to evaluate language models’ capabilities in handling extended contexts. By embedding task-relevant sentences within irrelevant text drawn from the PG19 dataset, the benchmark creates a challenging environment that mirrors real-world scenarios where crucial information is dispersed throughout lengthy documents. This approach allows for unlimited scaling of context length, enabling the evaluation of models with context windows of millions of tokens. The benchmark builds upon the original bAbI tasks, which assess fundamental reasoning capabilities through simulated interactions between characters and objects. These tasks labeled QA1 through QA20, test various cognitive abilities including spatial reasoning, temporal understanding, and deduction. Notably, this synthetic approach ensures immunity to training data contamination, a common vulnerability in traditional NLP benchmarks.

A comprehensive analysis of language models’ context utilization reveals significant limitations in their ability to process long sequences effectively. Testing across various question-answering tasks demonstrates that most current LLMs efficiently use only 10-20% of their advertised context window. Among 34 tested models, only 23 achieved the benchmark threshold of 85% accuracy on basic tasks without distractor text. Performance varies significantly across different architectures: while models like GPT-4 and Llama-3.1-70b maintain effectiveness up to 16K tokens, most models struggle beyond 4K tokens. Recent developments show promising improvements, with Qwen-2.5 models leading among open LLMs. The evaluation also explored alternative approaches, including Retrieval-Augmented Generation (RAG) and fine-tuned models. While RAG demonstrates limited success, fine-tuned recurrent memory models, particularly ARMT, show remarkable capabilities, processing sequences up to 50 million tokens with consistent performance.

BABILong represents a significant advancement in evaluating language models’ long-context capabilities through its unique combination of scalability and diverse reasoning tasks. The benchmark’s adaptable design allows for testing sequences from 0 to 10 million tokens while maintaining algorithmic control over document length and fact placement. Testing revealed that current models, including advanced systems like GPT-4 and Gemini 1.5 Pro, utilize only 5-25% of their input context effectively. While newer models like Llama-3.1 and Qwen-2.5 demonstrate improved performance, they still face limitations. Fine-tuning experiments proved particularly revealing, showing that even relatively small models like RMT and ARMT (137M parameters) can effectively handle BABILong tasks, with ARMT notably processing sequences up to 50 million tokens, far surpassing Mamba’s practical limit of 128K tokens.

The post Scaling Language Model Evaluation: From Thousands to Millions of Tokens with BABILong appeared first on MarkTechPost.

The Role of Specifications in Modularizing Large Language Models

Mohammad Asjad — Wed, 18 Dec 2024 04:02:16 +0000

Software has been a critical catalyst for economic growth over the past several decades, a phenomenon prominently articulated by Andreessen in his influential blog post, “Why software is eating the world.” The technological landscape is now witnessing another transformative wave with Artificial Intelligence, particularly Large Language Models (LLMs), poised to revolutionize the existing software ecosystem. Researchers argue that realizing the full potential of this technological advancement requires developing LLM-based systems with the same engineering rigor and reliability found in established disciplines like control theory, mechanical engineering, and software engineering. Specifications emerge as a fundamental tool that can facilitate this systematic development, enabling complex system decomposition, component reusability, and comprehensive system verification.

Generative AI has experienced remarkable progress over the past two decades, with an unprecedented acceleration since ChatGPT’s introduction. However, this advancement primarily stems from developing increasingly larger models, which demand extensive computational resources and substantial financial investments. Current state-of-the-art model development costs hundreds of millions of dollars, with projections suggesting future expenses could reach billions. This model development paradigm presents two significant challenges: first, the prohibitive costs limit model development to a few privileged companies, and second, the monolithic nature of these models complicates identifying and addressing output inaccuracies. Hallucinations remain the most prominent drawback, highlighting the complexity of debugging and refining these sophisticated AI systems. These constraints potentially impede the broader growth and democratization of artificial intelligence technologies.

Researchers from UC Berkeley, UC San Diego, Stanford University, and Microsoft Research distinguish between two types of specifications: statement specifications and solution specifications. Statement specifications define the fundamental objectives of a task, answering the critical question, “What should the task accomplish?” Conversely, solution specifications provide mechanisms to verify task outputs, addressing the query, “How can one validate that the solution meets the original specification?” Different domains illustrate this distinction uniquely: in traditional software development, statement specifications manifest as Product Requirements Documents, while solution specifications emerge through input-output tests. Formal frameworks like Coq/Gallina represent statement specifications through rigorous formal specifications and solution specifications via proofs demonstrating code correctness. In some instances, such as mathematical problem-solving, the statement and solution specifications can seamlessly converge, providing a unified approach to task definition and verification.

LLMs encounter a fundamental challenge in task specification: balancing the accessibility of natural language with its inherent ambiguity. This tension arises from the ability to specify tasks using prompts that can be simultaneously flexible and unclear. Some prompts are inherently ambiguous, rendering precise interpretation impossible, such as “Write a poem about a white horse in Shakespeare’s style.” Other prompts contain partially resolvable ambiguities that can be clarified through additional context or specification. For instance, a prompt like “How long does it take to go from Venice to Paris?” can be disambiguated by providing specific details about locations and transportation methods. Researchers propose various approaches to address these specification challenges, drawing inspiration from human communication strategies to develop more precise and effective LLM task definitions.

LLMs face significant challenges in verifiability and debuggability, fundamental engineering properties critical to system reliability. Verifiability involves assessing whether a task’s implementation adheres to its original specification, often complicated by ambiguous solution specifications and potential hallucinations. Researchers propose multiple approaches to enhance system verification, including proof-carrying-outputs, step-by-step verification, execute-then-verify techniques, and statistical verification methods. Debuggability presents an additional complex challenge, as LLMs function essentially as black boxes where traditional debugging techniques prove ineffective. Emerging strategies include generating multiple outputs, employing self-consistency checks, using mixture of outputs, and implementing process supervision to iteratively improve system performance. These techniques aim to transform LLM development from a trial-and-error approach to a more systematic, engineered methodology.

Engineering disciplines have historically driven remarkable economic progress through five critical properties: verifiability, debuggability, modularity, reusability, and automatic decision-making. These properties collectively enable developers to construct complex systems efficiently, build reliable infrastructures, and create autonomous solutions. The foundation of these engineering properties lies in clear, precise specifications that definitively describe task objectives and provide comprehensive verification mechanisms. Artificial Intelligence, particularly LLMs, stands at the threshold of another potential economic and social transformation. However, the prevalent ambiguity in LLM task specifications, primarily arising from natural language’s inherent complexity, presents a significant barrier to systematic development. Researchers argue that developing techniques to generate unambiguous statement and solution specifications is crucial for accelerating LLM technological advancement and expanding its practical applications.

Check out the Paper here. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

The post The Role of Specifications in Modularizing Large Language Models appeared first on MarkTechPost.

Google Released State of the Art ‘Veo 2’ for Video Generation and ‘Improved Imagen 3’ for Image Creation: Setting New Standards with 4K Video and Several Minutes Long Video Generation

Mohammad Asjad — Wed, 18 Dec 2024 00:07:31 +0000

Video and Image generation innovations are improving the quality of visuals and focusing on making AI models more responsive to detailed prompts. AI tools have opened new possibilities for artists, filmmakers, businesses, and creative professionals by achieving more accurate representations of real-world physics and human movement. AI-generated visuals are no longer limited to generic images and videos; they now allow for high-quality, cinematic outputs that closely mimic human creativity. This progress reflects the immense demand for technology that efficiently produces professional-grade results, offering opportunities across industries from entertainment to advertising.

The challenge in AI-based video and image generation has always been achieving realism and precision. Earlier models often struggled with inconsistencies in video content, such as hallucinated objects, distorted human movements, and unnatural lighting. Similarly, image generation tools sometimes need to follow user prompts accurately or render textures and details poorly. These shortcomings undermined their usability in professional settings where flawless execution is critical. AI models are needed to improve understanding of physics-based interactions, handle lighting effects, and reproduce intricate artistic details, which are fundamental to achieving visually appealing and accurate outputs.

Existing tools like Veo and Imagen have provided considerable improvements but have limitations. Veo allowed creators to generate video content with custom backgrounds and cinematic effects, while Imagen produced high-quality images in various art styles. YouTube creators, enterprise customers on Vertex AI, and artists through VideoFX and ImageFX extensively used these tools. They are good tools, but they often have technical constraints, such as inconsistent detail rendering, limited resolution capabilities, and the inability to adapt seamlessly to complex user prompts. As a result, creators required tools that combined precision, realism, and flexibility to meet professional standards.

Google Labs and Google DeepMind introduced Veo 2 and an upgraded Imagen 3 to improve the abovementioned problems. These models represent the next generation of AI-driven tools to achieve state-of-the-art video and image generation results. Veo 2 focuses on video production with improved realism, supporting resolutions up to 4K and extending video lengths to several minutes. It incorporates a deep understanding of cinematographic language, enabling users to specify lenses, cinematic effects, and camera angles. For instance, prompts like “18mm lens” or “low-angle tracking shot” allow the model to create wide-angle shots or immersive cinematic effects. Imagen 3 enhances image generation by producing richer textures, brighter visuals, and precise compositions across various art styles. These tools are now accessible through platforms like VideoFX, ImageFX, and Whisk, Google’s new experiment that combines AI-generated visuals with creative remixing capabilities.

Veo 2 brings several upgrades to video generation. The central one is its improved understanding of real-world physics and human expression. Unlike earlier models, Veo 2 accurately renders complex movements, natural lighting, and detailed backgrounds while minimizing hallucinated artifacts like extra fingers or floating objects. Users can create videos with genre-specific effects, motion dynamics, and storytelling elements. For example, the tool allows prompts to include phrases such as “shallow depth of field” or “smooth panning shot,” resulting in videos that mirror professional filmmaking techniques. Imagen 3 similarly delivers exceptional improvements by following prompts with greater fidelity. It generates photorealistic textures, detailed compositions, and art styles ranging from anime to impressionism. These models offer professional-grade visual content creation that adapts to user requirements.

Image Source

In evaluations, in head-to-head comparisons judged by human raters, Veo 2 outperformed leading video models regarding realism, quality, and prompt adherence. Imagen 3 achieved state-of-the-art results in image generation, excelling in texture precision, composition accuracy, and color grading. The upgraded models also feature SynthID watermarks to identify outputs as AI-generated, ensuring ethical usage and mitigating misinformation risks.

With Veo 2 and Improved Imagen 3, Whisk is a new experimental tool by the team that integrates Imagen 3 with Google’s Gemini model for image-based visualizations. Whisk allows users to upload or create images and remix their subjects, scenes, and styles to generate new visuals. Whisk combines the latest Imagen 3 model with Gemini’s visual understanding and description capabilities. The Gemini model automatically writes a detailed caption of the images and feeds those descriptions into Imagen 3. This process allows users to easily remix the subjects, scenes, and styles in fun, new ways. For instance, the tool can transform a hand-drawn concept into a polished digital output by analyzing and enhancing the image through AI algorithms.

Some of the highlights of ‘Veo 2’:

Veo 2 creates videos at up to 4K resolution with extended lengths of several minutes.
It reduces hallucinated artifacts such as extra objects or distorted human movements.
Also, it accurately interprets cinematographic language (lens type, camera angles, and motion effects).
Veo 2 improves understanding of real-world physics and human expressions for greater realism.
It allows cinematic prompts, such as “low-angle tracking shots” and “shallow depth of field,” to produce professional outputs.
It integrates with Google Labs’ VideoFX platform for widespread usability.

Some of the highlights of ‘Improved Imagen 3’:

Now, Imagen 3 produces brighter, more detailed images with improved textures and compositions.
It accurately follows prompts across diverse art styles, including photorealism, anime, and impressionism.
Imagen 3 enhances color grading and detail rendering for sharper, richer visuals.
It minimizes inconsistencies in generated outputs, achieving state-of-the-art image quality.
Accessible through Google Labs’ ImageFX platform and supports creative applications.

Image Source

In conclusion, Google Labs and DeepMind research introduce parallel upgrades in AI-driven video and image generation. Veo 2 and Imagen 3 set new benchmarks for professional-grade content creation by addressing long-standing challenges in visual realism and user control. These tools improve video and image fidelity, enabling creators to specify intricate details and achieve cinematic outputs. With innovations like Whisk, users gain access to creative workflows that were previously unattainable. The combination of precision, ethical safeguards, and innovative flexibility ensures that Veo 2 and Imagen 3 will impact the AI-generated visuals positively.

Check out the details for Veo 2 and Imagen 3. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

The post Google Released State of the Art ‘Veo 2’ for Video Generation and ‘Improved Imagen 3’ for Image Creation: Setting New Standards with 4K Video and Several Minutes Long Video Generation appeared first on MarkTechPost.

Meta FAIR Releases Meta Motivo: A New Behavioral Foundation Model for Controlling Virtual Physics-based Humanoid Agents for a Wide Range of Complex Whole-Body Tasks

Mohammad Asjad — Mon, 16 Dec 2024 18:39:12 +0000

Foundation models, pre-trained on extensive unlabeled data, have emerged as a cutting-edge approach for developing versatile AI systems capable of solving complex tasks through targeted prompts. Researchers are now exploring the potential of extending this paradigm beyond language and visual domains, focusing on behavioral foundation models (BFMs) for agents interacting with dynamic environments. Specifically, the research aims to develop BFMs for humanoid agents, targeting whole-body control through proprioceptive observations. This approach addresses a long-standing challenge in robotics and AI, characterized by the high-dimensionality and intrinsic instability of humanoid control systems. The ultimate goal is to create generalized models that can express diverse behaviors in response to various prompts, including imitation, goal achievement, and reward optimization.

Meta researchers introduce FB-CPR (Forward-Backward representations with Conditional Policy Regularization), an innovative online unsupervised reinforcement learning algorithm designed to ground policy learning through observation-only unlabeled behaviors. The algorithm’s key technical innovation involves utilizing forward-backward representations to embed unlabeled trajectories into a shared latent space, utilizing a latent-conditional discriminator to encourage policies to comprehensively “cover” dataset states. Demonstrating the method’s effectiveness, the team developed META MOTIVO, a behavioral foundation model for whole-body humanoid control that can be prompted to solve diverse tasks such as motion tracking, goal reaching, and reward optimization in a zero-shot learning scenario. The model utilizes the SMPL skeleton and AMASS motion capture dataset to achieve remarkable behavioral expressiveness.

Researchers introduce a robust approach to forward-backward (FB) representation learning with conditional policy regularization. At the pre-training stage, the agent has access to an unlabeled behavior dataset containing observation-only trajectories. The method focuses on developing a continuous set of latent-conditioned policies where latent variables are drawn from a distribution defined over a latent space. By representing behaviors through the joint space of states and latent variables, the researchers aim to capture diverse motion patterns. The key innovation lies in inferring latent variables for each trajectory using the ERFB method, which allows encoding trajectories into a shared representational space. The ultimate goal is to regularize the unsupervised training of the behavioral foundation model by minimizing the discrepancy between the induced policy distribution and the dataset distribution.

The research presents a comprehensive performance evaluation of the FB-CPR algorithm across multiple task categories. FB-CPR demonstrates remarkable zero-shot capabilities, achieving 73.4% of top-line algorithm performance without explicit task-specific training. In reward-maximization tasks, the method outperforms unsupervised baselines, notably achieving 177% of DIFFUSER’s performance while maintaining significantly lower computational complexity. For goal-reaching tasks, FB-CPR performs comparably to specialized baselines, outperforming zero-shot alternatives by 48% and 118% in proximity and success metrics respectively. A human evaluation study further revealed that while task-specific algorithms might achieve higher numerical performance, FB-CPR was consistently perceived as more “human-like”, with participants rating its behaviors as more natural in 83% of reward-based tasks and 69% of goal-reaching scenarios.

This research introduced FB-CPR, a unique algorithm that combines zero-shot properties of forward-backward models with innovative regularization techniques for policy learning using unlabeled behavior datasets. By training the first behavioral foundation model for complex humanoid agent control, the method demonstrated state-of-the-art performance across diverse tasks. Despite its significant achievements, the approach has notable limitations. FB-CPR struggles with tasks far removed from motion-capture datasets and occasionally produces imperfect movements, particularly in scenarios involving falling or standing. The current model is restricted to proprioceptive observations and cannot navigate environments or interact with objects. Future research directions include integrating additional state variables, exploring complex perception methods, utilizing video-based human activity datasets, and developing more direct language-policy alignment techniques to expand the model’s capabilities and generalizability.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

The post Meta FAIR Releases Meta Motivo: A New Behavioral Foundation Model for Controlling Virtual Physics-based Humanoid Agents for a Wide Range of Complex Whole-Body Tasks appeared first on MarkTechPost.

Beyond the Mask: A Comprehensive Study of Discrete Diffusion Models

Mohammad Asjad — Sun, 15 Dec 2024 15:57:32 +0000

Masked diffusion has emerged as a promising alternative to autoregressive models for the generative modeling of discrete data. Despite its potential, existing research has been constrained by overly complex model formulations and ambiguous relationships between different theoretical perspectives. These limitations have resulted in suboptimal parameterization and training objectives, often requiring ad hoc adjustments to address inherent challenges. Diffusion models have rapidly evolved since their inception, becoming a dominant approach for generative media and achieving state-of-the-art performance across various domains. Significant breakthroughs have been particularly notable in image synthesis, audio generation, and video production, demonstrating the transformative potential of this innovative modeling technique.

The researchers from Google DeepMind focus on masked (or absorbing) diffusions, a discrete diffusion framework introduced in Structured Denoising Diffusion Models in Discrete State-Spaces, and subsequently explored from multiple perspectives. By adopting a continuous-time approach that has been instrumental in advancing continuous state space diffusions, the study aims to enhance the understanding and performance of discrete data generation models. The research presents several key technical contributions designed to simplify model training and significantly improve performance. The primary objectives include establishing robust properties of the forward process, developing a simplified Evidence Lower Bound (ELBO) expression, and creating a unified theoretical framework that critically examines existing continuous-time discrete diffusion models.

The researchers introduce a unique approach to masked diffusion within a finite discrete state space. By augmenting the original state space with an additional mask state, they define a forward “masking” process that transforms data points into a mask state at random times. The discrete-time framework divides the interval [0, 1] into discrete segments, with a transition matrix governing state changes. Each transition probability determines whether a state remains unchanged or jumps to the mask state. By taking the limit of this discrete process, the researchers develop a continuous-time forward process that enables more sophisticated modeling of data evolution. This approach provides a flexible and mathematically rigorous method for the generative modeling of discrete data.

The researchers develop a generative model by defining a reverse process that approximately reverses the forward transitions. They introduce a mean-parameterization approach where a neural network predicts the probability distribution of the original data point. The model uses a softmax-applied neural network to generate probability vectors, with a unique constraint that the mask state cannot be predicted as the clean data. The objective function is derived as an ELBO, which provides a lower bound of the log marginal likelihood. By taking a continuous-time limit, the researchers demonstrate that the objective can be expressed as an integral of cross-entropy losses. Importantly, they show that the objective exhibits invariance properties similar to continuous state-space diffusion models, with the signal-to-noise ratio playing a crucial role in the formulation.

Researchers explore sampling strategies for their discrete-time reverse process, focusing on generation and conditional generation techniques. They discover that ancestral sampling yields slightly higher sample quality compared to alternative methods like Euler discretization. For conditional generation tasks such as infilling, they recommend keeping conditioning tokens unmasked throughout the generation process. A critical finding involves the impact of time discretization on sample quality, particularly when using different masking schedules. By switching from a linear to a cosine schedule, they dramatically improved the Fréchet Inception Distance (FID) score on ImageNet 64×64 from 70 to 17 using 256 steps. The researchers hypothesize that the cosine schedule’s success stems from its ability to utilize information redundancy, making remaining tokens more predictable and reducing unmasking conflicts during generation.

By conducting comprehensive experiments on text and image modeling to validate their masked diffusion approach. For text experiments, researchers utilized two datasets: text8 (character-level text from Wikipedia) and OpenWebText. They introduced two model variants: MD4 (Masked Discrete Diffusion for Discrete Data) and GenMD4 (generalized state-dependent model). On OpenWebText, their GPT-2 small and medium models outperformed previous discrete diffusion models across five benchmark datasets, demonstrating superior zero-shot perplexity performance. The models consistently achieved better results than GPT-2, with particularly strong performance across tasks like WikiText2, Penn Treebank, and One Billion Words. Notably, the researchers observed faster model convergence and more stable training compared to previous approaches.

To sum up, this study emphasizes the key contributions of the masked diffusion approach proposed by the researchers. They address the complexity and accessibility challenges in existing masked diffusion models by developing a flexible continuous-time formulation with a remarkably simple Evidence Lower Bound expression. By presenting a weighted integral of cross-entropy losses, they simplify the optimization process that previously hindered model performance. The researchers introduced two model variants: MD4 and GenMD4, with the latter offering a state-dependent masking schedule. Their experimental results demonstrate significant improvements across different domains. On text data, MD4 outperformed existing discrete and continuous diffusion models, while in pixel-level image modeling, the approach achieved competitive likelihoods comparable to continuous diffusion models and surpassed similar-sized autoregressive models. The generalized model, GenMD4, further enhanced likelihood performance, showcasing the potential of state-dependent diffusion techniques.

The post Beyond the Mask: A Comprehensive Study of Discrete Diffusion Models appeared first on MarkTechPost.

Alibaba Qwen Researchers Introduced ProcessBench: A New AI Benchmark for Measuring the Ability to Identify Process Errors in Mathematical Reasoning

Mohammad Asjad — Sat, 14 Dec 2024 19:47:26 +0000

According to recent research by multiple scholars, language models have demonstrated remarkable advancements in complex reasoning tasks, including mathematics and programming. Despite these significant improvements, these models continue to encounter challenges when addressing particularly difficult problems. The emerging field of scalable oversight seeks to develop effective supervision methods for artificial intelligence systems that approach or surpass human-level performance. Researchers anticipate that language models can potentially identify errors within their own reasoning processes automatically. However, existing evaluation benchmarks face critical limitations, with some problem sets becoming less challenging for advanced models and others providing only binary correctness assessments without detailed error annotations. This gap highlights the need for more nuanced and comprehensive evaluation frameworks that can thoroughly examine the reasoning mechanisms of sophisticated language models.

Several benchmark datasets have emerged to assess language models’ reasoning processes, each contributing unique insights into error identification and solution critique. CriticBench focuses on evaluating language models’ capabilities to critique solutions and rectify mistakes across various reasoning domains. MathCheck utilizes the GSM8K dataset to synthesize solutions with intentional errors, and challenging models to identify incorrect reasoning steps and final answers. The PRM800K benchmark, built upon MATH problems, provides comprehensive annotations for reasoning step correctness and soundness, generating significant research interest in process reward models. These benchmarks represent critical advances in understanding and improving the error-detection capabilities of language models, offering increasingly sophisticated methods to evaluate their reasoning mechanisms.

Qwen Team and Alibaba Inc. researchers introduce PROCESSBENCH, a robust benchmark designed to measure language models’ capabilities in identifying erroneous steps within mathematical reasoning. This benchmark distinguishes itself through three key design principles: problem difficulty, solution diversity, and comprehensive evaluation. PROCESSBENCH specifically targets competition and Olympiad-level mathematical problems, utilizing multiple open-source language models to generate solutions that demonstrate varied solving approaches. The benchmark comprises 3,400 test cases, each meticulously annotated by multiple human experts to ensure high data quality and evaluation reliability. Unlike previous benchmarks, PROCESSBENCH adopts a straightforward evaluation protocol that requires models to pinpoint the earliest erroneous step in a solution, making it adaptable for different model types, including process reward models and critic models. This approach provides a robust framework for assessing reasoning error detection capabilities.

The researchers developed PROCESSBENCH through a meticulous process of problem curation, solution generation, and expert annotation. They collected mathematical problems from four established datasets: GSM8K, MATH, OlympiadBench, and Omni-MATH, ensuring a comprehensive range of problem difficulties from grade school to competition level. Solutions were generated using open-source models from the Qwen and LLaMA series, creating twelve distinct solution generators to maximize solution diversity. To address inconsistencies in solution step formatting, the team implemented a reformatting method using Qwen2.5-72B-Instruct to standardize step granularity, ensuring logically complete and progressive reasoning steps. This approach helped maintain solution content integrity while creating a more uniform annotation framework for subsequent expert evaluation.

The evaluation results of PROCESSBENCH revealed several critical insights into the performance of process reward models (PRMs) and critic models across different mathematical problem difficulties. As problem complexity increased from GSM8K and MATH to OlympiadBench and Omni-MATH, a consistent performance decline was observed across all models, highlighting significant generalization challenges. Existing PRMs demonstrated notably weaker performance compared to top prompt-driven critic models, particularly on simpler problem sets. The research uncovered fundamental limitations in current PRM development methodologies, which often rely on estimating step correctness based on final answer probabilities. These approaches inherently struggle with the nuanced nature of mathematical reasoning, especially when models can reach correct answers through flawed intermediate steps. The study emphasized the critical need for more robust error identification strategies to accurately assess the reasoning process beyond the correctness of the correctness of the final answer.

This research introduces PROCESSBENCH as a pioneering benchmark for assessing language models’ capabilities in identifying mathematical reasoning errors. By integrating high-difficulty problems, diverse solution generation, and rigorous human expert annotation, the benchmark provides a comprehensive framework for evaluating error detection mechanisms. The study’s key findings highlight significant challenges in current process reward models, particularly their limited ability to generalize across varying problem complexities. Also, the research reveals an emerging landscape of open-source language models that are progressively approaching the performance of proprietary models in critical reasoning and error identification tasks. These insights underscore the importance of developing more sophisticated methodologies for understanding and improving artificial intelligence’s reasoning processes.

Check out the Paper, GitHub Page, and Data on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

The post Alibaba Qwen Researchers Introduced ProcessBench: A New AI Benchmark for Measuring the Ability to Identify Process Errors in Mathematical Reasoning appeared first on MarkTechPost.

Best-of-N Jailbreaking: A Multi-Modal AI Approach to Identifying Vulnerabilities in Large Language Models

Mohammad Asjad — Fri, 13 Dec 2024 12:00:00 +0000

The advancement of AI model capabilities raises significant concerns about potential misuse and security risks. As artificial intelligence systems become more sophisticated and support diverse input modalities, the need for robust safeguards has become paramount. Researchers have identified critical threats, including the potential for cybercrime, biological weapon development, and the spread of harmful misinformation. Multiple studies from leading AI research organizations highlight the substantial risks associated with inadequately protected AI systems. Jailbreaks, maliciously designed inputs aimed at circumventing safety measures, pose particularly serious challenges. Consequently, the academic and technological communities are exploring automated red-teaming methods to evaluate and enhance model safety across different input modalities comprehensively

Research on LLM jailbreaks has revealed diverse methodological approaches to identifying and exploiting system vulnerabilities. Various studies have explored different strategies for eliciting jailbreaks, including decoding variations, fuzzing techniques, and optimization of target log probabilities. Researchers have developed methods that range from gradient-dependent approaches to modality-specific augmentations, each addressing unique challenges in AI system security. Recent investigations have demonstrated the versatility of LLM-assisted attacks, utilizing language models themselves to craft sophisticated breach strategies. The research landscape encompasses a wide range of techniques, from manual red-teaming to genetic algorithms, highlighting the complex nature of identifying and mitigating potential security risks in advanced AI systems.

Researchers from Speechmatics, MATS, UCL, Stanford University, University of Oxford, Tangentic, and Anthropic introduce Best-of-N (BoN) Jailbreaking, a sophisticated black-box automated red-teaming method capable of supporting multiple input modalities. This innovative approach repeatedly samples augmentations to prompts, seeking to trigger harmful responses across different AI systems. Experiments demonstrated remarkable effectiveness, with BoN achieving an attack success rate of 78% on Claude 3.5 Sonnet using 10,000 augmented samples, and surprisingly, 41% success with just 100 augmentations. The method’s versatility extends beyond text, successfully jailbreaking six state-of-the-art vision language models by manipulating image characteristics and four audio language models by altering audio parameters. Importantly, the research uncovered a power-law-like scaling behavior, suggesting that computational resources can be strategically utilized to increase the likelihood of identifying system vulnerabilities.

BoN Jailbreaking emerges as a sophisticated black-box algorithm designed to exploit AI model vulnerabilities through strategic input manipulation. The method systematically applies modality-specific augmentations to harmful requests, ensuring the original intent remains recognizable. Augmentation techniques include random capitalization for text inputs, background modifications for images, and audio pitch alterations. The algorithm generates multiple variations of each request, evaluates the model’s response using GPT-4o and the HarmBench grader prompt, and classifies outputs for potential harmfulness. To assess effectiveness, researchers employed the Attack Success Rate (ASR) across 159 direct requests from the HarmBench test dataset, carefully scrutinizing potential jailbreaks through manual review. The methodology ensures comprehensive evaluation by considering even partially harmful responses as potential security breaches.

The research comprehensively evaluated BoN Jailbreaking across text, vision, and audio domains, achieving an impressive 70% ASR averaged across multiple models and modalities. In text language models, BoN demonstrated remarkable effectiveness, successfully breaching safeguards of leading AI models including Claude 3.5 Sonnet, GPT-4o, and Gemini models. Notably, the method achieved ASRs over 50% on all eight tested models, with Claude Sonnet experiencing a staggering 78% breach rate. Vision language model tests revealed lower but still significant success rates, ranging from 25% to 88% across different models. Audio language model experiments were particularly striking, with BoN achieving high ASRs between 59% and 87% across Gemini, GPT-4o, and DiVA models, highlighting the vulnerability of AI systems across diverse input modalities.

This research introduces Best-of-N Jailbreaking as an innovative algorithm capable of bypassing safeguards in frontier Large Language Models across multiple input modalities. By employing repeated sampling of augmented prompts, BoN successfully achieves high Attack Success Rates on leading AI models such as Claude 3.5 Sonnet, Gemini Pro, and GPT-4o. The method demonstrates a power-law scaling behavior that can predict attack success rates over an order of magnitude, and its effectiveness can be further amplified by combining it with techniques like Modality-Specific Jailbreaking (MSJ). Fundamentally, the study underscores the significant challenges in securing AI models with stochastic outputs and continuous input spaces, presenting a simple yet scalable black-box approach to identifying and exploiting vulnerabilities in state-of-the-art language models.

The post Best-of-N Jailbreaking: A Multi-Modal AI Approach to Identifying Vulnerabilities in Large Language Models appeared first on MarkTechPost.

Latent Functional Maps: A Robust Machine Learning Framework for Analyzing Neural Network Representations

Mohammad Asjad — Tue, 10 Dec 2024 18:16:06 +0000

Neural networks (NNs) remarkably transform high-dimensional data into compact, lower-dimensional latent spaces. While researchers traditionally focus on model outputs like classification or generation, understanding the internal representation geometry has emerged as a critical area of investigation. These internal representations offer profound insights into neural network functionality, enabling researchers to repurpose learned features for downstream tasks and compare different models’ structural properties. The exploration of these representations provides a deeper understanding of how neural networks process and encode information, revealing underlying patterns that transcend individual model architectures.

Comparing representations learned by neural models is crucial across various research domains, from representation analysis to latent space alignment. Researchers have developed multiple methodologies to measure similarity between different spaces, ranging from functional performance matching to representational space comparisons. Canonical Correlation Analysis (CCA) and its adaptations, such as Singular Vector Canonical Correlation Analysis (SVCCA) and Projection-Weighted Canonical Correlation Analysis (PWCCA), have emerged as classical statistical methods for this purpose. Centered Kernel Alignment (CKA) offers another approach to measure latent space similarities, though recent studies have highlighted its sensitivity to local shifts, indicating the need for more robust analytical techniques.

Researchers from IST Austria and Sapienza, University of Rome, have pioneered a robust approach to understanding neural network representations by shifting from sample-level relationships to modeling mappings between function spaces. The proposed method, Latent Functional Map (LFM), utilizes spectral geometry principles to provide a comprehensive framework for representational alignment. By applying functional map techniques originally developed for 3D geometry processing and graph applications, LFM offers a flexible tool for comparing and finding correspondences across distinct representational spaces. This innovative approach enables unsupervised and weakly supervised methods to transfer information between different neural network representations, presenting a significant advancement in understanding the intrinsic structures of learned latent spaces.

LFM involves three critical steps: constructing a graph representation of the latent space, encoding preserved quantities through descriptor functions, and optimizing the functional map between different representational spaces. By building a symmetric k-nearest neighbor graph, the method captures the underlying manifold geometry, allowing for a nuanced exploration of neural network representations. The technique can handle latent spaces of arbitrary dimensions and provides a flexible tool for comparing and transferring information across different neural network models.

LFM similarity measure demonstrates remarkable robustness compared to the widely used CKA method. While CKA is sensitive to local transformations that preserve linear separability, the LFM approach maintains stability across various perturbations. Experimental results reveal that the LFM similarity remains consistently high even as input spaces undergo significant changes, in contrast to CKA’s performance degradation. Visualization techniques, including t-SNE projections, highlight the method’s ability to localize distortions and maintain semantic integrity, particularly in challenging classification tasks involving complex data representations.

The research introduces Latent Functional Maps as an innovative approach to understanding and analyzing neural network representations. The method provides a comprehensive framework for comparing and aligning latent spaces across different models by applying spectral geometry principles. The approach demonstrates significant potential in addressing critical challenges in representation learning, offering a robust methodology for finding correspondences and transferring information with minimal anchor points. This innovative technique extends the functional map framework to high-dimensional spaces, presenting a versatile tool for exploring the intrinsic structures and relationships between neural network representations.

[Must Subscribe]: Subscribe to our newsletter to get trending AI research and dev updates

The post Latent Functional Maps: A Robust Machine Learning Framework for Analyzing Neural Network Representations appeared first on MarkTechPost.

Voyage AI Introduces voyage-code-3: A New Next-Generation Embedding Model Optimized for Code Retrieval

Mohammad Asjad — Mon, 09 Dec 2024 15:47:11 +0000

Research in code embedding models has witnessed a significant breakthrough with the introduction of voyage-code-3, an advanced embedding model specifically designed for code retrieval tasks by researchers from Voyage AI. The model demonstrates remarkable performance, substantially outperforming existing state-of-the-art solutions like OpenAI-v3-large and CodeSage-large. Empirical evaluations across a comprehensive suite of 238 code retrieval datasets reveal that voyage-code-3 achieves an impressive average performance improvement of 13.80% and 16.81% over these competing models, highlighting its potential to revolutionize code search and retrieval technologies.

The development of voyage-code-3 introduces innovative approaches to address the computational challenges in vector-based search, particularly for extensive code repositories. Matryoshka embeddings and advanced quantization techniques emerge as critical strategies to mitigate storage and search costs. The model tackles the linear scalability challenge by supporting lower-dimensional embeddings and implementing binary and int8 quantization methods. These technological advancements enable significant cost reductions while maintaining robust retrieval performance, presenting a transformative solution for large-scale code search and management systems.

The landscape of code retrieval represents a complex domain with multifaceted challenges that extend beyond traditional text search methodologies. Unique computational demands arise from the intricate nature of programming languages, requiring sophisticated algorithmic reasoning and a nuanced understanding of syntax structures. Code retrieval encompasses diverse subtasks, including text-to-code, code-to-code, and docstring-to-code retrievals, each demanding precise semantic comprehension and advanced matching capabilities. These sophisticated retrieval scenarios necessitate advanced embedding models capable of capturing intricate programmatic relationships and context-specific nuances.

The evaluation of voyage-code-3 represents a rigorous and methodical approach to assessing code embedding model performance, addressing critical limitations in existing benchmarking practices. Researchers developed a comprehensive evaluation framework that goes beyond traditional assessment methods, recognizing the inherent challenges in existing datasets. By identifying and mitigating issues such as noisy labels and potential data contamination, the study aimed to create a more robust and realistic assessment of code retrieval capabilities. The evaluation strategy incorporated diverse tasks, including text-to-code and code-to-code retrievals, and utilized repurposed question-answer datasets to provide a more nuanced and comprehensive understanding of the model’s capabilities.

The experimental results of voyage-code-3 demonstrate substantial performance gains across various dimensional configurations and storage cost scenarios. At 1024 and 256 dimensions, the model outperforms OpenAI-v3-large by 14.64% and 17.66%, respectively, showcasing impressive retrieval capabilities. Moreover, the model achieves a 13.80% performance improvement while utilizing only one-third of the original storage costs, comparing 1024 and 3072 dimensions. In an even more remarkable achievement, voyage-code-3 maintains a 4.81% performance advantage at an extraordinary storage cost reduction of 1/384, comparing binary 256-dimensional embeddings with float 3072-dimensional embeddings. The introduction of binary rescoring techniques further enhances retrieval quality, potentially yielding up to a 4.25% improvement when applied to standard binary retrieval methods.

Voyage-code-3 emerges as an innovative embedding model that sets new benchmarks in code retrieval technology. The model demonstrates exceptional performance, significantly surpassing existing solutions like OpenAI-v3-large and CodeSage-large across a comprehensive suite of 238 code retrieval datasets. With impressive average performance improvements of 13.80% and 16.81%, respectively, voyage-code-3 represents a significant leap forward in embedding model capabilities. Its versatile design supports multiple embedding dimensions ranging from 256 to 2048, providing users with unprecedented flexibility in balancing retrieval quality and computational efficiency.

Check out the Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 60k+ ML SubReddit.

[Must Attend Webinar]: ‘Transform proofs-of-concept into production-ready AI applications and agents’ _(Promoted)

The post Voyage AI Introduces voyage-code-3: A New Next-Generation Embedding Model Optimized for Code Retrieval appeared first on MarkTechPost.