Tech News Category - MarkTechPost https://www.marktechpost.com/category/tech-news/ An Artificial Intelligence News Platform Sat, 28 Dec 2024 07:38:26 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://www.marktechpost.com/wp-content/uploads/2022/04/cropped-Favicon-512-x-512-1-1-32x32.png Tech News Category - MarkTechPost https://www.marktechpost.com/category/tech-news/ 32 32 127842392 Camel-AI Open Sourced OASIS: A Next Generation Simulator for Realistic Social Media Dynamics with One Million Agents https://www.marktechpost.com/2024/12/27/camel-ai-open-sourced-oasis-a-next-generation-simulator-for-realistic-social-media-dynamics-with-one-million-agents/ https://www.marktechpost.com/2024/12/27/camel-ai-open-sourced-oasis-a-next-generation-simulator-for-realistic-social-media-dynamics-with-one-million-agents/#respond Sat, 28 Dec 2024 07:38:21 +0000 https://www.marktechpost.com/?p=66776 Social media platforms have revolutionized human interaction, creating dynamic environments where millions of users exchange information, form communities, and influence one another. These platforms, including X and Reddit, are not just tools for communication but have become critical ecosystems for understanding modern societal behaviors. Simulating such intricate interactions is vital for studying misinformation, group polarization, […]

The post Camel-AI Open Sourced OASIS: A Next Generation Simulator for Realistic Social Media Dynamics with One Million Agents appeared first on MarkTechPost.

]]>

Social media platforms have revolutionized human interaction, creating dynamic environments where millions of users exchange information, form communities, and influence one another. These platforms, including X and Reddit, are not just tools for communication but have become critical ecosystems for understanding modern societal behaviors. Simulating such intricate interactions is vital for studying misinformation, group polarization, and herd behavior. Computational models provide researchers a cost-effective and scalable way to analyze these interactions without conducting resource-intensive real-world experiments. But, creating models replicating the scale and complexity of social networks remains a significant challenge.

The primary issue in modeling social media is capturing millions of users’ diverse behaviors and interactions in a dynamic network. Traditional agent-based models (ABMs) fall short of representing complex behaviors like context-driven decision-making or the influence of dynamic recommendation algorithms. Also, existing models are often limited to small-scale simulations, typically involving only hundreds or thousands of agents, which restricts their ability to mimic large-scale social systems. Such constraints hinder researchers from fully exploring phenomena like how misinformation spreads or how group dynamics evolve in online environments. These limitations highlight the need for more advanced and scalable simulation tools.

Existing methods for simulating social media interactions often lack essential features like dynamic user networks, detailed recommendation systems, and real-time updates. For instance, most ABMs rely on pre-programmed agent behaviors, which fail to reflect the nuanced decision-making seen in real-world users. Also, current simulators are typically platform-specific, designed to study isolated phenomena, making them impractical for broader applications. They cannot often scale beyond a few thousand agents, leaving researchers unable to examine the behaviors of millions of users interacting simultaneously. The absence of scalable, versatile models has been a major bottleneck in advancing social media research.

Researchers from Camel-AI, Shanghai Artificial Intelligence Laboratory, Dalian University of Technology, Oxford, KAUST, Fudan University, Xi’an Jiaotong University, Imperial College London, Max Planck Institute, and The University of Sydney developed OASIS, a next-generation social media simulator designed for scalability and adaptability to address these challenges. OASIS is built upon modular components, including an Environment Server, Recommendation System (RecSys), Time Engine, and Agent Module. It supports up to one million agents, making it one of the most comprehensive simulators. This system incorporates dynamically updated networks, diverse action spaces, and advanced algorithms to replicate real-world social media dynamics. By integrating data-driven methods and open-source frameworks, OASIS provides a flexible platform for studying phenomena across platforms like X and Reddit, enabling researchers to explore topics ranging from information propagation to herd behavior.

The architecture of OASIS emphasizes both scale and functionality. The functions of some of the components are as follows: 

  • Its Environment Server is the backbone, storing detailed user profiles, historical interactions, and social connections.
  • The Recommendation System customizes content visibility using advanced algorithms such as TwHIN-BERT, which processes user interests and recent activities to rank posts. 
  • The Time Engine governs user activation based on hourly probabilities, simulating realistic online behavior patterns. 

These components work together to create a simulation environment that can adapt to different platforms and scenarios. Switching from X to Reddit requires minimal module adjustments, making OASIS a versatile tool for social media research. Its distributed computing infrastructure ensures efficient handling of large-scale simulations, even with up to one million agents.

In experiments modeling information propagation on X, OASIS achieved a normalized RMSE of approximately 30%, demonstrating its ability to align with actual dissemination trends. The simulator also replicated group polarization, showing that agents tend to adopt more extreme opinions during interactions. This effect was particularly pronounced in uncensored models, where agents used more extreme language. Moreover, OASIS revealed unique insights, such as the herd effect being more evident in agents than in humans. Agents consistently followed negative trends when exposed to down-treated comments, while humans displayed a stronger critical approach. These findings underscore the simulator’s potential to uncover both expected and novel patterns in social behavior.

With OASIS, larger agent groups lead to richer and more diverse interactions. For example, when the number of agents increased from 196 to 10,196, the diversity and helpfulness of user responses improved significantly, with a 76.5% increase in perceived helpfulness. At an even larger scale of 100,196 agents, user interactions became more varied and meaningful, illustrating the importance of scalability in studying group behavior. Also, OASIS demonstrated that misinformation spreads more effectively than truthful information, particularly when rumors are emotionally provocative. The simulator also showed how isolated user groups form over time, providing valuable insights into the dynamics of online communities.

Key takeaways from the OASIS research include:

  1. OASIS can simulate up to one million agents, far surpassing the capabilities of existing models.
  2. It supports multiple platforms, including X and Reddit, with modular components that are easily adjustable.
  3. The simulator replicates phenomena like group polarization and herd behavior, providing a deeper understanding of these dynamics.
  4. OASIS achieved a normalized RMSE of 30% in information propagation experiments, closely aligning with real-world trends.
  5. It demonstrated that rumors spread faster and more widely than truthful information in large-scale simulations.
  6. Larger agent groups enhance the diversity and helpfulness of responses, emphasizing the importance of scale in social media studies.
  7. OASIS distributed computing allows for efficient handling of simulations, even with millions of agents.

In conclusion, OASIS is a breakthrough in simulating social media dynamics, offering scalability and adaptability. OASIS addresses existing model limitations and provides a robust framework for studying complex-scale interactions. Integrating LLMs with rule-based agents accurately mimics the behaviors of up to one million users across platforms like X and Reddit. Its ability to replicate complex phenomena, such as information propagation, group polarization, and herd effects, provides researchers valuable insights into modern social ecosystems.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

🚨 Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

The post Camel-AI Open Sourced OASIS: A Next Generation Simulator for Realistic Social Media Dynamics with One Million Agents appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2024/12/27/camel-ai-open-sourced-oasis-a-next-generation-simulator-for-realistic-social-media-dynamics-with-one-million-agents/feed/ 0 66776
Collective Monte Carlo Tree Search (CoMCTS): A New Learning-to-Reason Method for Multimodal Large Language Models https://www.marktechpost.com/2024/12/27/collective-monte-carlo-tree-search-comcts-a-new-learning-to-reason-method-for-multimodal-large-language-models/ https://www.marktechpost.com/2024/12/27/collective-monte-carlo-tree-search-comcts-a-new-learning-to-reason-method-for-multimodal-large-language-models/#respond Sat, 28 Dec 2024 07:32:35 +0000 https://www.marktechpost.com/?p=66773 In today’s world, Multimodal large language models (MLLMs) are advanced systems that process and understand multiple input forms, such as text and images. By interpreting these diverse inputs, they aim to reason through tasks and generate accurate outputs. However, MLLMs often fail at complex tasks because they lack structured processes to break problems into smaller […]

The post Collective Monte Carlo Tree Search (CoMCTS): A New Learning-to-Reason Method for Multimodal Large Language Models appeared first on MarkTechPost.

]]>

In today’s world, Multimodal large language models (MLLMs) are advanced systems that process and understand multiple input forms, such as text and images. By interpreting these diverse inputs, they aim to reason through tasks and generate accurate outputs. However, MLLMs often fail at complex tasks because they lack structured processes to break problems into smaller steps and instead provide direct answers without clear intermediate reasoning. These limitations reduce the success and efficiency of MLLMs in solving intricate problems.

Traditional methods for reasoning in multimodal large language models (MLLMs) have many problems. Prompt-based methods, like Chain-of-Thought, use set steps to copy human reasoning but struggle with difficult tasks. Plant-based methods, like Tree or Graph-of-Thought, try to find reasoning paths but are not flexible or reliable. Learning-based methods, like Monte Carlo Tree Search (MCTS), are slow and do not help with deep thinking. Most MLLMs rely on “direct prediction,” giving short answers without clear steps. Although MCTS works well in games and robotics, it is unsuited for MLLMs, and collective learning does not build strong step-by-step reasoning. These issues make it hard for MLLMs to solve complex problems.

To mitigate these issues, a team researchers from Nanyang Technological University, Tsinghua University, Baidu, and Sun Yat-sen University proposed CoMCTS, a framework to improve reasoning-path search in tree search tasks. Instead of relying on one model, it combines multiple pre-trained models to expand and evaluate candidate paths. This approach differs from traditional methods because it uses a more efficient strategy: several models work together, allowing for better performance and reducing errors during the reasoning process.

It consisted of four key steps: Expansion, Simulation, Backpropagation, and Selection. In the Expansion step, several models looked for different solutions simultaneously, increasing the variety of possible answers. In the Simulation step, incorrect or less effective paths were removed, making the search easier. During the Backpropagation step, the models improved by learning from their past mistakes and using that knowledge to make better predictions. The last step used a statistical method to choose the best action for the model to take. Reflective reasoning in this process helped the model learn from previous errors to make better decisions in similar tasks.

The researchers created the Mulberry-260K dataset, which comprised 260K multimodal input questions, combining text instructions and images from various domains, including general multimodal understanding, mathematics, science, and medical image understanding. The dataset was constructed using CoMCTS with training limited to 15K samples to avoid overabundance. The reasoning tasks required an average of 7.5 steps, with most tasks falling within the 6 to 8-step range. CoMCTS was implemented using four models: GPT4o, Qwen2-VL-7B, LLaMA-3.2-11B-Vision-Instruct, and Qwen2-VL-72B. The training process involved a batch size of 128 and a learning rate 1e-5 for two epochs.

The results demonstrated significant performance improvements over the baseline models, with gains of +4.2% and +7.5% for Qwen2-VL-7B and LLaMA-3.2-11B-Vision-Instruct, respectively. Additionally, the Mulberry dataset outperformed reasoning models like LLaVA-Reasoner-8B and Insight-V-8B, showing superior performance on various benchmarks. Upon evaluation, CoMCTS improved its performance by 63.8%. The involvement of reflective reasoning data led to slight improvements in model performance. This reveals the effects of Mulberry-260K and CoMCTS in improving the accuracy and flexibility of reasoning.

In conclusion, the proposed CoMCTS proves to be an approach that improves reasoning in multimodal large language models (MLLMs) by incorporating collective learning into tree search methods. This framework improved the efficiency of searching for a reasoning path, as demonstrated by the Mulberry-260K dataset and the Mulberry model, which surpasses traditional models in complex reasoning tasks. The proposed methods provide valuable insights for future research, can serve as a basis for advancing MLLMs, and can act as a baseline for developing more efficient models capable of handling increasingly complex tasks.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

🚨 Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

The post Collective Monte Carlo Tree Search (CoMCTS): A New Learning-to-Reason Method for Multimodal Large Language Models appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2024/12/27/collective-monte-carlo-tree-search-comcts-a-new-learning-to-reason-method-for-multimodal-large-language-models/feed/ 0 66773
YuLan-Mini: A 2.42B Parameter Open Data-efficient Language Model with Long-Context Capabilities and Advanced Training Techniques https://www.marktechpost.com/2024/12/27/yulan-mini-a-2-42b-parameter-open-data-efficient-language-model-with-long-context-capabilities-and-advanced-training-techniques/ https://www.marktechpost.com/2024/12/27/yulan-mini-a-2-42b-parameter-open-data-efficient-language-model-with-long-context-capabilities-and-advanced-training-techniques/#respond Sat, 28 Dec 2024 01:51:39 +0000 https://www.marktechpost.com/?p=66770 Large language models (LLMs) built using transformer architectures heavily depend on pre-training with large-scale data to predict sequential tokens. This complex and resource-intensive process requires enormous computational infrastructure and well-constructed data pipelines. The growing demand for efficient and accessible LLMs has led researchers to explore techniques that balance resource use and performance, emphasizing achieving competitive […]

The post YuLan-Mini: A 2.42B Parameter Open Data-efficient Language Model with Long-Context Capabilities and Advanced Training Techniques appeared first on MarkTechPost.

]]>

Large language models (LLMs) built using transformer architectures heavily depend on pre-training with large-scale data to predict sequential tokens. This complex and resource-intensive process requires enormous computational infrastructure and well-constructed data pipelines. The growing demand for efficient and accessible LLMs has led researchers to explore techniques that balance resource use and performance, emphasizing achieving competitive results without relying on industry-scale resources.

Developing LLMs is filled with challenges, especially regarding computation and data efficiency. Pre-training models with billions of parameters demand advanced techniques and substantial infrastructure. High-quality data and robust training methods are crucial, as models face gradient instability and performance degradation during training. Open-source LLMs often struggle to match proprietary counterparts because of limited access to computational power and high-caliber datasets. Therefore, the challenge lies in creating efficient and high-performing models, enabling smaller research groups to participate actively in advancing AI technology. Solving this problem necessitates innovation in data handling, training stabilization, and architectural design.

Existing research in LLM training emphasizes structured data pipelines, using techniques like data cleaning, dynamic scheduling, and curriculum learning to improve learning outcomes. However, stability remains a persistent issue. Large-scale training is susceptible to gradient explosions, loss spikes, and other technical difficulties, requiring careful optimization. Training long-context models introduce additional complexity as attention mechanisms’ computational demands grow quadratically with sequence length. Existing approaches like advanced optimizers, initialization strategies, and synthetic data generation help alleviate these issues but often fall short when scaled to full-sized models. The need for scalable, stable, and efficient methods in LLM training is more urgent than ever.

Researchers at the Gaoling School of Artificial Intelligence, Renmin University of China, developed YuLan-Mini. With 2.42 billion parameters, this language model improves computational efficiency and performance with data-efficient methods. By leveraging publicly available data and focusing on data-efficient training techniques, YuLan-Mini achieves remarkable performance comparable to larger industry models.

YuLan-Mini’s architecture incorporates several innovative elements to enhance training efficiency. Its decoder-only transformer design employs embedding tying to reduce parameter size and improve training stability. The model uses Rotary Positional Embedding (ROPE) to handle long contexts effectively, extending its context length to 28,672 tokens, an advancement over typical models. Other key features include SwiGLU activation functions for better data representation and a carefully designed annealing strategy that stabilizes training while maximizing learning efficiency. Synthetic data was critical, supplementing the 1.08 trillion tokens of training data sourced from open web pages, code repositories, and mathematical datasets. These features enable YuLan-Mini to deliver robust performance with a limited computing budget.

YuLan-Mini’s performance achieved scores of 64.00 on HumanEval in zero-shot scenarios, 37.80 on MATH-500 in four-shot settings, and 49.10 on MMLU in five-shot tasks. These results underscore its competitive edge, as the model’s performance is comparable to much larger and resource-intensive counterparts. The innovative context length extension to 28K tokens allowed YuLan-Mini to excel in long-text scenarios while still maintaining high accuracy in short-text tasks. This dual capability sets it apart from many existing models, which often sacrifice one for the other.

Key takeaways from the research include:

  • Using a meticulously designed data pipeline, YuLan-Mini reduces reliance on massive datasets while ensuring high-quality learning.
  • Techniques like systematic optimization and annealing prevent common issues like loss spikes and gradient explosions.
  • Extending the context length to 28,672 tokens enhances the model’s applicability to complex, long-text tasks.
  • Despite its modest computational requirements, YuLan-Mini achieves results comparable to those of much larger models, demonstrating the effectiveness of its design.
  • The integration of synthetic data improves training outcomes and reduces the need for proprietary datasets.

In conclusion, YuLan-Mini is a great new addition to evolving efficient LLMs. Its ability to deliver high performance with limited resources addresses critical barriers to AI accessibility. The research team’s focus on innovative techniques, from data efficiency to training stability, highlights the potential for smaller-scale research to contribute to the field significantly. With just 1.08T tokens, YuLan-Mini sets a benchmark for resource-efficient LLMs.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

🚨 Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

The post YuLan-Mini: A 2.42B Parameter Open Data-efficient Language Model with Long-Context Capabilities and Advanced Training Techniques appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2024/12/27/yulan-mini-a-2-42b-parameter-open-data-efficient-language-model-with-long-context-capabilities-and-advanced-training-techniques/feed/ 0 66770
Quasar-1: A Rigorous Mathematical Framework for Temperature-Guided Reasoning in Language Models https://www.marktechpost.com/2024/12/27/quasar-1-a-rigorous-mathematical-framework-for-temperature-guided-reasoning-in-language-models/ https://www.marktechpost.com/2024/12/27/quasar-1-a-rigorous-mathematical-framework-for-temperature-guided-reasoning-in-language-models/#respond Sat, 28 Dec 2024 01:46:07 +0000 https://www.marktechpost.com/?p=66767 Large language models (LLMs) encounter significant difficulties in performing efficient and logically consistent reasoning. Existing methods, such as CoT prompting, are extremely computationally intensive, not scalable, and unsuitable for real-time applications or limited resources. These limitations restrict their applicability in financial analysis and decision-making, which require speed and accuracy. State-of-the-art reasoning approaches, like CoT, build […]

The post Quasar-1: A Rigorous Mathematical Framework for Temperature-Guided Reasoning in Language Models appeared first on MarkTechPost.

]]>

Large language models (LLMs) encounter significant difficulties in performing efficient and logically consistent reasoning. Existing methods, such as CoT prompting, are extremely computationally intensive, not scalable, and unsuitable for real-time applications or limited resources. These limitations restrict their applicability in financial analysis and decision-making, which require speed and accuracy.

State-of-the-art reasoning approaches, like CoT, build structured paths for reasoning to improve the accuracy of logic. However, they are computationally demanding and not feasible for applications requiring responses within a short time or where resources are limited. They also do not scale well for handling multiple complex queries at the same time, which limits their application in production environments, especially in organizations with limited computing resources.

Researchers from SILX AI introduced Quasar-1, a groundbreaking framework based on temperature-guided reasoning, to address these challenges. The two main components are the Token Temperature Mechanism (TTM), which dynamically changes the importance of tokens during reasoning, and the Guided Sequence of Thought (GSoT), which computes the optimal reasoning paths. This architecture reduces unnecessary computation and maintains logical consistency using token temperatures to focus on contextually relevant information. Architecture exemplifies considerable advancements, such as improved scalability, efficiency, and adaptability in practical applications.

The framework is constructed upon a transformer-based design, supplemented by temperature-modulated attention mechanisms. The TTM computes temperatures specific to each token to steer reasoning throughout the layers, dynamically modifying token significance as the reasoning evolves. GSoT employs this temperature information to formulate both efficient and precise reasoning pathways. Quasar-1 has 24 transformer layers with 12 attention heads so that efficiency and effectiveness are optimally balanced. Empirical verifications for a range of different reasoning tasks ensure that theoretical foundations about convergence to an optimal solution are provided.

Quasar-1 performs well, reaching 89.3% accuracy, beating models like GPT-3 and T5-Large. It reduces computational costs by up to 70% and ensures faster and more resource-efficient reasoning capabilities. The framework dynamically prioritizes critical tokens, allowing adaptive error recovery and logical consistency, which makes it fit for complex real-world tasks. These results underline its potential as a practical and scalable solution for environments where both efficiency and accuracy are vital.

By employing temperature-guided reasoning and optimized decision pathways, Quasar-1 overcomes fundamental flaws in existing models, thus providing a scalable and practical approach to logical reasoning. Dynamic token prioritization and adaptive error recovery drive the AI domain forward with practical applications in diverse and resource-constrained environments. This represents a significant milestone in the quest for AI systems that are both highly efficient accurate and flexible.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

🚨 Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

The post Quasar-1: A Rigorous Mathematical Framework for Temperature-Guided Reasoning in Language Models appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2024/12/27/quasar-1-a-rigorous-mathematical-framework-for-temperature-guided-reasoning-in-language-models/feed/ 0 66767
Unveiling Privacy Risks in Machine Unlearning: Reconstruction Attacks on Deleted Data https://www.marktechpost.com/2024/12/27/unveiling-privacy-risks-in-machine-unlearning-reconstruction-attacks-on-deleted-data/ https://www.marktechpost.com/2024/12/27/unveiling-privacy-risks-in-machine-unlearning-reconstruction-attacks-on-deleted-data/#respond Sat, 28 Dec 2024 01:42:21 +0000 https://www.marktechpost.com/?p=66764 Machine unlearning is driven by the need for data autonomy, allowing individuals to request the removal of their data’s influence on machine learning models. This field complements data privacy efforts, which focus on preventing models from revealing sensitive information about the training data through attacks like membership inference or reconstruction. While differential privacy methods limit […]

The post Unveiling Privacy Risks in Machine Unlearning: Reconstruction Attacks on Deleted Data appeared first on MarkTechPost.

]]>

Machine unlearning is driven by the need for data autonomy, allowing individuals to request the removal of their data’s influence on machine learning models. This field complements data privacy efforts, which focus on preventing models from revealing sensitive information about the training data through attacks like membership inference or reconstruction. While differential privacy methods limit these risks, unlearning enables the deletion of data from a trained model, ensuring it behaves as if the data were never included in the first place. Achieving this efficiently, without retraining the entire model, has been a key focus, particularly for complex models like deep neural networks.

However, unlearning introduces new privacy risks. When adversaries compare a model’s parameters before and after data deletion, they can exploit the differences to reconstruct the deleted data, even for simple models like linear regression. This process leverages the gradient of the deleted sample and the expected Hessian derived from public data to approximate the changes caused by unlearning. The approach highlights a unique vulnerability where unlearning unintentionally exposes sensitive data. By extending existing techniques for gradient-based reconstruction attacks, this research reveals how unlearning can facilitate exact data reconstruction, emphasizing the importance of safeguards like differential privacy to mitigate these risks.

Researchers from AWS AI, the University of Pennsylvania, the University of Washington, Carnegie Mellon University, and Jump Trading reveal that data deletion in machine learning models, even simple ones, exposes individuals to high-accuracy reconstruction attacks. These attacks recover deleted data by exploiting differences in model parameters before and after deletion. The study demonstrates effective attacks on linear regression models using closed-form training algorithms and extends these methods to models with pre-trained embeddings and generic architectures via Newton’s method. Experiments on tabular and image datasets highlight significant privacy risks in retraining for unlearning without safeguards like differential privacy.

The researchers present an attack to reconstruct deleted user data from regularized linear regression models by analyzing parameter changes before and after deletion. The method leverages the relationship between model parameters and the removed sample, approximating key statistics using public data. The approach generalizes to models with fixed embeddings and extends to non-linear architectures using Newton’s approximation method. Experiments demonstrate its applicability to multiclass classification and label inference by estimating gradients and reconstructing deleted data. This highlights the vulnerability of models to privacy breaches, especially without safeguards, as the attack remains effective across various architectures and loss functions.

The study evaluates our attack across diverse datasets for classification and regression tasks, including tabular and image data. Using full retraining, they compare model parameters before and after a single sample’s deletion. Our method leverages public data from the same distribution without needing knowledge of the deleted sample. Against baselines like “Avg” (average of public samples) and “MaxDiff” (maximizing parameter change), our attack consistently outperforms, achieving higher cosine similarity with deleted samples. Tested on MNIST, CIFAR10, and ACS income data, our approach reconstructs deleted samples effectively across various models, emphasizing vulnerabilities in machine learning systems and the need for privacy safeguards.

In conclusion, The work introduces a reconstruction attack capable of recovering deleted data from simple machine-learning models with high accuracy. The attack achieves near-perfect results for linear regression and performs effectively on models using embeddings or optimizing different loss functions. Highlighting privacy risks in data deletion or machine unlearning, the findings emphasize the need for techniques like differential privacy. Counterintuitively, data deletion updates can increase vulnerability to reconstruction attacks, even in basic models, exposing sensitive data. Through extensive experiments on diverse datasets, this study underscores the significant privacy risks posed by data deletion requests, even in seemingly low-risk model settings.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

🚨 Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

The post Unveiling Privacy Risks in Machine Unlearning: Reconstruction Attacks on Deleted Data appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2024/12/27/unveiling-privacy-risks-in-machine-unlearning-reconstruction-attacks-on-deleted-data/feed/ 0 66764
Meet SemiKong: The World’s First Open-Source Semiconductor-Focused LLM https://www.marktechpost.com/2024/12/27/meet-semikong-the-worlds-first-open-source-semiconductor-focused-llm/ https://www.marktechpost.com/2024/12/27/meet-semikong-the-worlds-first-open-source-semiconductor-focused-llm/#respond Fri, 27 Dec 2024 20:16:13 +0000 https://www.marktechpost.com/?p=66760 The semiconductor industry enables advancements in consumer electronics, automotive systems, and cutting-edge computing technologies. The production of semiconductors involves sophisticated processes that demand unparalleled precision and expertise. These processes include chip design, manufacturing, testing, and optimization, each stage requiring deep domain knowledge. The field has traditionally depended on seasoned engineers whose experience has been built […]

The post Meet SemiKong: The World’s First Open-Source Semiconductor-Focused LLM appeared first on MarkTechPost.

]]>

The semiconductor industry enables advancements in consumer electronics, automotive systems, and cutting-edge computing technologies. The production of semiconductors involves sophisticated processes that demand unparalleled precision and expertise. These processes include chip design, manufacturing, testing, and optimization, each stage requiring deep domain knowledge. The field has traditionally depended on seasoned engineers whose experience has been built over decades. However, the industry faces a significant challenge: the rapid retirement of veteran experts, creating a knowledge gap that threatens innovation and efficiency. This growing concern has prompted companies to explore AI as a viable solution for capturing, scaling, and leveraging expert knowledge. Also, the cost and time associated with chip design and manufacturing must be minimized to meet market demands. These challenges highlight the limitations of traditional methods and emphasize the necessity of tailored AI solutions.

Existing approaches to these challenges include generalized AI models and basic automation tools. While these methods have been beneficial in analyzing data and improving decision-making, they often fall short in addressing the unique complexities of the semiconductor industry. General-purpose AI tools, for instance, lack the domain-specific understanding required to analyze intricate manufacturing processes effectively. As a result, companies cannot fully bridge the gap between theoretical AI capabilities and practical industry needs, leaving room for specialized solutions to transform the field.

Researchers from Meta, AITOMATIC, and other collaborators under the Foundation Models workgroup of the AI Alliance have introduced SemiKong. SemiKong represents the world’s first semiconductor-focused large language model (LLM), designed using the Llama 3.1 platform. This model was fine-tuned with extensive semiconductor-specific datasets, including industry documents, research papers, and anonymized operational data. Unlike generic AI systems, SemiKong is tailored to understand semiconductor processes’ unique terminology and requirements. By integrating this model with the AITOMATIC Domain-Expert Agents (DXAs), companies can effectively leverage AI tools to address specific industry challenges. These innovations aim to reduce costs, accelerate development timelines, and promote collaboration across the semiconductor sector.

The technology behind SemiKong is built on advanced AI and neurosymbolic architectures. AITOMATIC’s DXAs operate through a structured three-phase lifecycle: 

  1. Capturing domain expertise
  2. Training the model with synthetic and structured data
  3. Applying the resulting system in real-world scenarios 

SemiKong plays a central role in this ecosystem, acting as the “brain” for complex reasoning and decision-making tasks. Lightweight model versions, such as Llama 3.2, complement the main system by enabling faster data access and analysis in resource-constrained environments. These models integrate seamlessly with manufacturing systems and IoT platforms, allowing companies to optimize workflows, predict maintenance needs, and improve decision-making.

SemiKong has outperformed several closed-source language models in generating semiconductor-specific content and understanding complex processes. This has led to tangible benefits, including a 20-30% reduction in time to market for new chip designs and a 15-25% improvement in first-time-right manufacturing outcomes. These tools have also improved the onboarding process for new engineers, accelerating their learning curve by 40-50%. In one example, SemiKong-enabled DXAs reduced the time required for etching recipe formulation, which typically takes hours to minutes.

The key takeaways from the research underscore the significance of SemiKong and DXAs in the semiconductor field:

  1. DXAs effectively capture and structure the knowledge of veteran engineers, ensuring that critical expertise is preserved and scaled for future use.  
  2. SemiKong reduces chip design time-to-market by up to 30%, significantly cutting costs and improving operational efficiency.  
  3. By simplifying and expediting the onboarding process, DXAs help new engineers become productive faster, reducing the industry’s reliance on seasoned experts.  
  4. Integrating IoT platforms enables real-time parameter calibration and predictive maintenance, enhancing equipment performance and reliability.

In conclusion, the research highlights a pioneering solution to one of the semiconductor industry’s most pressing challenges: the loss of critical domain expertise. By introducing SemiKong and DXAs, the researchers have provided a comprehensive framework that preserves knowledge and enhances productivity and innovation. These advancements can potentially reshape semiconductor manufacturing, offering scalable, cost-effective solutions to address the field’s complexities. Integrating AI tools like SemiKong is crucial for a more efficient and resilient semiconductor industry.


Check out the Details and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

🚨 Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

The post Meet SemiKong: The World’s First Open-Source Semiconductor-Focused LLM appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2024/12/27/meet-semikong-the-worlds-first-open-source-semiconductor-focused-llm/feed/ 0 66760
Google DeepMind Introduces Differentiable Cache Augmentation: A Coprocessor-Enhanced Approach to Boost LLM Reasoning and Efficiency https://www.marktechpost.com/2024/12/27/google-deepmind-introduces-differentiable-cache-augmentation-a-coprocessor-enhanced-approach-to-boost-llm-reasoning-and-efficiency/ https://www.marktechpost.com/2024/12/27/google-deepmind-introduces-differentiable-cache-augmentation-a-coprocessor-enhanced-approach-to-boost-llm-reasoning-and-efficiency/#respond Fri, 27 Dec 2024 20:02:30 +0000 https://www.marktechpost.com/?p=66757 Large language models (LLMs) are integral to solving complex problems across language processing, mathematics, and reasoning domains. Enhancements in computational techniques focus on enabling LLMs to process data more effectively, generating more accurate and contextually relevant responses. As these models become complex, researchers strive to develop methods to operate within fixed computational budgets without sacrificing […]

The post Google DeepMind Introduces Differentiable Cache Augmentation: A Coprocessor-Enhanced Approach to Boost LLM Reasoning and Efficiency appeared first on MarkTechPost.

]]>

Large language models (LLMs) are integral to solving complex problems across language processing, mathematics, and reasoning domains. Enhancements in computational techniques focus on enabling LLMs to process data more effectively, generating more accurate and contextually relevant responses. As these models become complex, researchers strive to develop methods to operate within fixed computational budgets without sacrificing performance.

One major challenge in optimizing LLMs is their inability to effectively reason across multiple tasks or perform computations beyond their pre-trained architecture. Current methods for improving model performance involve generating intermediate steps during task processing, often at the cost of increased latency and computational inefficiency. This limitation hampers their ability to perform complex reasoning tasks, particularly those requiring longer dependencies or higher accuracy in predictions.

Researchers have explored methods like Chain-of-Thought (CoT) prompting, which guides LLMs to reason step by step. While effective in some cases, CoT relies on sequential processing of intermediate reasoning steps, leading to slower computation times. KV-cache compression has also been proposed to reduce memory usage but does little to improve reasoning capabilities. These approaches, though valuable, underscore the need for a method that combines efficiency with enhanced reasoning ability.

Researchers from Google DeepMind have introduced a method called Differentiable Cache Augmentation. This technique uses a trained coprocessor to augment the LLM’s key-value (kv) cache with latent embeddings, enriching the model’s internal memory. The key innovation lies in keeping the base LLM frozen while training the coprocessor, which operates asynchronously. The researchers designed this method to enhance reasoning capabilities without increasing the computational burden during task execution.

The methodology revolves around a three-stage process. First, the frozen LLM generates a kv-cache from an input sequence, encapsulating its internal representation. This kv-cache is passed to the coprocessor, which processes it with additional trainable soft tokens. Not tied to specific words, these tokens act as abstract prompts for generating latent embeddings. Once processed, the augmented kv-cache is fed back into the LLM, enabling it to generate contextually enriched outputs. This asynchronous operation ensures the coprocessor’s enhancements are applied efficiently without delaying the LLM’s primary functions. Training the coprocessor is conducted using a language modeling loss, focusing solely on its parameters while preserving the integrity of the frozen LLM. This targeted approach allows for scalable and effective optimization.

Performance evaluations demonstrated significant improvements. The method was tested on the Gemma-2 2B model, achieving considerable results across various benchmarks. For instance, on the reasoning-intensive GSM8K dataset, accuracy improved by 10.05% when 64 latent embeddings were used. Similarly, MMLU performance increased by 4.70% under the same configuration. These enhancements underscore the model’s ability to perform better on complex reasoning tasks. Further, perplexity reductions were observed at multiple token positions. For example, perplexity decreased by 3.94% at position one and 1.20% at position 32 when 64 latent embeddings were applied, showcasing the model’s improved prediction capabilities over longer sequences.

Further analysis showed that the augmentation’s effectiveness scales with the number of latent embeddings. For GSM8K, accuracy rose incrementally with additional embeddings, from 1.29% with four embeddings to the peak improvement of 10.05% with 64 embeddings. Similar trends were observed in other benchmarks like ARC and MATH, indicating the broader applicability of this method. The researchers confirmed that their approach consistently outperformed baseline models without task-specific fine-tuning, demonstrating its robustness and adaptability.

This work represents a significant step forward in enhancing LLMs’ reasoning capabilities. By introducing an external coprocessor to augment the kv-cache, the researchers from Google DeepMind have created a method that improves performance while maintaining computational efficiency. The results highlight the potential for LLMs to tackle more complex tasks, paving the way for further exploration into modular enhancements and scalable reasoning systems. This breakthrough underscores the importance of continual innovation in AI to meet the growing demands of reasoning-intensive applications.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

🚨 Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

The post Google DeepMind Introduces Differentiable Cache Augmentation: A Coprocessor-Enhanced Approach to Boost LLM Reasoning and Efficiency appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2024/12/27/google-deepmind-introduces-differentiable-cache-augmentation-a-coprocessor-enhanced-approach-to-boost-llm-reasoning-and-efficiency/feed/ 0 66757
AWS Researchers Propose LEDEX: A Machine Learning Training Framework that Significantly Improves the Self-Debugging Capability of LLMs https://www.marktechpost.com/2024/12/26/aws-researchers-propose-ledex-a-machine-learning-training-framework-that-significantly-improves-the-self-debugging-capability-of-llms/ https://www.marktechpost.com/2024/12/26/aws-researchers-propose-ledex-a-machine-learning-training-framework-that-significantly-improves-the-self-debugging-capability-of-llms/#respond Fri, 27 Dec 2024 07:26:34 +0000 https://www.marktechpost.com/?p=66754 Code generation using Large Language Models (LLMs) has emerged as a critical research area, but generating accurate code for complex problems in a single attempt remains a significant challenge. Even skilled human developers often require multiple iterations of trial-and-error debugging to solve difficult programming problems. While LLMs have demonstrated impressive code generation capabilities, their self-debugging […]

The post AWS Researchers Propose LEDEX: A Machine Learning Training Framework that Significantly Improves the Self-Debugging Capability of LLMs appeared first on MarkTechPost.

]]>

Code generation using Large Language Models (LLMs) has emerged as a critical research area, but generating accurate code for complex problems in a single attempt remains a significant challenge. Even skilled human developers often require multiple iterations of trial-and-error debugging to solve difficult programming problems. While LLMs have demonstrated impressive code generation capabilities, their self-debugging ability to analyze incorrect code and make necessary corrections is still limited. This limitation is evident in open-source models like StarCoder and CodeLlama, which show significantly lower self-refinement performance compared to models like GPT-3.5-Turbo.

Existing approaches to improve code generation and debugging capabilities in LLMs have followed several distinct paths. LLMs have shown significant success across various code-related tasks, including code generation, bug fixing, program testing, and fuzzing. These models use extensive pre-training on vast datasets to understand patterns and generate contextually relevant code. However, most existing work has primarily focused on single-round generation rather than iterative improvement. Other methods like ILF, CYCLE, and Self-Edit have explored supervised fine-tuning approaches while solutions like OpenCodeInterpreter and EURUS have attempted to create high-quality multi-turn interaction datasets using advanced models for fine-tuning purposes.

Researchers from Purdue University, AWS AI Labs, and the University of Virginia have proposed LEDEX (learning to self-debug and explain code), a novel training framework designed to enhance LLMs’ self-debugging capabilities. The framework builds on the observation that a sequential process of explaining incorrect code followed by refinement enables LLMs to analyze and improve faulty code in a better way. LEDEX implements an automated pipeline to collect high-quality datasets for code explanation, and refinement. Moreover, it combines supervised fine-tuning (SFT) and reinforcement learning (RL) approaches, utilizing successful and failed trajectories with a specialized reward system that evaluates code explanation and refinement quality.

LEDEX employs a comprehensive architecture containing data collection, verification, and multi-stage training processes. The framework begins by collecting code explanation and refinement datasets through queries to pre-trained or instruction-tuned models. These responses undergo rigorous execution-based verification to filter and maintain only high-quality explanation and refinement data. The collected dataset then serves as input for supervised fine-tuning which significantly enhances the model’s capabilities in bug explanation and code refinement. LEDEX uses programming problems from MBPP, APPS, and CodeContests to train data. To expand the dataset of incorrect solutions, the framework prompts pre-trained LLMs like StarCoder and CodeLlama with 3-shot examples to generate 20 solutions per problem.

LEDEX is evaluated using three model backbones: StarCoder-15B, CodeLlama-7B, and CodeLlama-13B, with initial training data collected from GPT-3.5-Turbo. The SFT phase shows significant improvements, achieving up to a 15.92% increase in pass@1 and 9.30% in pass@10 metrics across four benchmark datasets. The subsequent RL phase further enhances performance with additional improvements of up to 3.54% in pass@1 and 2.55% in pass@10. Notably, LEDEX’s model-agnostic nature is shown through experiments with CodeLlama-7B, which achieve substantial improvements (8.25% in pass@1 and 2.14% in pass@10) even when trained on data collected from CodeLlama-34B or itself, proving its effectiveness independent of GPT-3.5-Turbo.

In conclusion, researchers introduced LEDEX, a comprehensive and scalable framework that combines automated data collection, verification processes, SFT, and RL with innovative reward designs to significantly improve LLMs’ ability to identify and correct code errors. The framework’s model-agnostic nature is evidenced by its successful implementation with GPT-3.5-Turbo and CodeLlama, while its rigorous data verification process ensures the quality of code explanations and refinements. Human evaluations further validate the framework’s effectiveness, confirming that LEDEX-trained models produce superior code explanations that effectively assist developers in understanding and resolving code issues.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

🚨 Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

The post AWS Researchers Propose LEDEX: A Machine Learning Training Framework that Significantly Improves the Self-Debugging Capability of LLMs appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2024/12/26/aws-researchers-propose-ledex-a-machine-learning-training-framework-that-significantly-improves-the-self-debugging-capability-of-llms/feed/ 0 66754
Meet AIArena: A Blockchain-Based Decentralized AI Training Platform https://www.marktechpost.com/2024/12/26/meet-aiarena-a-blockchain-based-decentralized-ai-training-platform/ https://www.marktechpost.com/2024/12/26/meet-aiarena-a-blockchain-based-decentralized-ai-training-platform/#respond Fri, 27 Dec 2024 07:22:10 +0000 https://www.marktechpost.com/?p=66751 The monopolization of any industry into the hands of a few giant companies has always been a matter of concern. Now, even artificial intelligence (AI) has fallen prey to these circumstances. Such monopolization of AI raises concerns like the concentration of power and resources, data monopoly and privacy, lack of transparency, and accountability. Furthermore, biases […]

The post Meet AIArena: A Blockchain-Based Decentralized AI Training Platform appeared first on MarkTechPost.

]]>

The monopolization of any industry into the hands of a few giant companies has always been a matter of concern. Now, even artificial intelligence (AI) has fallen prey to these circumstances. Such monopolization of AI raises concerns like the concentration of power and resources, data monopoly and privacy, lack of transparency, and accountability. Furthermore, biases from those limited groups of developers could lead to discrimination. To address these critical issues, researchers from Imperial College London, Newcastle University, FLock.io, and the University of Hong Kong have developed an innovative solution, AIArena, a blockchain-based platform that can decentralize AI training.

Traditionally, AI training has been relying on centralized approaches. Large companies possess the means and resources to collect data, henceforth monopolizing AI easily. This limits the innovative development of AI because of the restricted access to data and resources. Because of this centralized nature, entire systems can fail, leading to a massive security risk. Hence, there is a need for a new kind of method that can decentralize AI training in a fair and transparent manner and invite diverse, innovative contributions.

The proposed solution, AIArena, where people worldwide can work together to create and improve AI models, uses blockchain technology to ensure transparency and legitimacy. The methodology includes the following key components:

  • Blockchain Infrastructure: A record of all activities on the platform is recorded on the blockchain to ensure transparency. Also, the interactions between the participants are governed by a smart contract, which self-executes based on predefined rules. 
  • Federated Learning Framework: Contributors use their own data to improve the model performance. The platform ensures that only the updated model configurations are stored on the platform and not the data. Updates keep aggregating iteratively, which enhances the model’s global performance.
  • Incentive Mechanism: Contributors earn tokens for their participation, whether they provide data, computational resources, or valuable model updates. These tokens are then used for token-based participation in certain tasks like becoming a validator. 
  • Consensus Protocols for Model Updates: Before the platform accepts the upgraded model, it needs to be validated to ensure no malicious content is uploaded. This helps maintain the model’s integrity as it gets updated globally. 

AIArena was tested and validated by implementing a public blockchain testnet and evaluating several AI tasks. The validation results showed that AIArena is feasible in real-world applications, suggesting the viability of its approach toward decentralized AI training in addressing challenges related to centralized AI development.

In conclusion, AIArena proposes a transformative solution to the challenges of centralized AI training through blockchain-based transparency and federated learning for privacy-preserving collaboration. It is well poised to create an equitable, decentralized ecosystem where data and computational resources can be shared securely by various stakeholders, ensuring that problems with data silos, security risks, and a lack of transparency do not become a bottleneck for progress. Its novel incentive mechanism and robust architecture exhibit great potential for scalable, secure, and inclusive AI development. While this idea is relatively easy to implement, AIArena offers promising foundations for democratizing AI training and, thus, broad collaboration within different industries requiring fairness, security, and transparency.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

🚨 Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

The post Meet AIArena: A Blockchain-Based Decentralized AI Training Platform appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2024/12/26/meet-aiarena-a-blockchain-based-decentralized-ai-training-platform/feed/ 0 66751
DeepSeek-AI Just Released DeepSeek-V3: A Strong Mixture-of-Experts (MoE) Language Model with 671B Total Parameters with 37B Activated for Each Token https://www.marktechpost.com/2024/12/26/deepseek-ai-just-released-deepseek-v3-a-strong-mixture-of-experts-moe-language-model-with-671b-total-parameters-with-37b-activated-for-each-token/ https://www.marktechpost.com/2024/12/26/deepseek-ai-just-released-deepseek-v3-a-strong-mixture-of-experts-moe-language-model-with-671b-total-parameters-with-37b-activated-for-each-token/#respond Fri, 27 Dec 2024 04:32:12 +0000 https://www.marktechpost.com/?p=66743 The field of Natural Language Processing (NLP) has made significant strides with the development of large-scale language models (LLMs). However, this progress has brought its own set of challenges. Training and inference require substantial computational resources, the availability of diverse, high-quality datasets is critical, and achieving balanced utilization in Mixture-of-Experts (MoE) architectures remains complex. These […]

The post DeepSeek-AI Just Released DeepSeek-V3: A Strong Mixture-of-Experts (MoE) Language Model with 671B Total Parameters with 37B Activated for Each Token appeared first on MarkTechPost.

]]>

The field of Natural Language Processing (NLP) has made significant strides with the development of large-scale language models (LLMs). However, this progress has brought its own set of challenges. Training and inference require substantial computational resources, the availability of diverse, high-quality datasets is critical, and achieving balanced utilization in Mixture-of-Experts (MoE) architectures remains complex. These factors contribute to inefficiencies and increased costs, posing obstacles to scaling open-source models to match proprietary counterparts. Moreover, ensuring robustness and stability during training is an ongoing issue, as even minor instabilities can disrupt performance and necessitate costly interventions.

DeepSeek-AI just gave a Christmas present to the AI world by releasing DeepSeek-V3, a Mixture-of-Experts (MoE) language model featuring 671 billion parameters, with 37 billion activated per token. The model builds on proven architectures such as Multi-Head Latent Attention (MLA) and DeepSeekMoE, which were refined in earlier versions. DeepSeek-V3 has been trained on an extensive dataset of 14.8 trillion high-quality tokens, ensuring a broad and diverse knowledge base. Importantly, the model is fully open-source, with accessible models, papers, and training frameworks for the research community to explore.

Technical Details and Benefits

DeepSeek-V3 incorporates several innovations aimed at addressing long-standing challenges in the field. Its auxiliary-loss-free load balancing strategy ensures efficient distribution of computational loads across experts while maintaining model performance. The adoption of a multi-token prediction training objective enhances data efficiency and facilitates faster inference through speculative decoding. Additionally, FP8 mixed precision training improves computational efficiency by reducing GPU memory usage without sacrificing accuracy. The DualPipe algorithm further minimizes pipeline bubbles by overlapping computation and communication phases, reducing all-to-all communication overhead. These advancements enable DeepSeek-V3 to process 60 tokens per second during inference—a significant improvement over its predecessor.

Performance Insights and Results

DeepSeek-V3 has been rigorously evaluated across multiple benchmarks, demonstrating strong performance. On educational datasets like MMLU and MMLU-Pro, it achieved scores of 88.5 and 75.9, respectively, outperforming other open-source models. In mathematical reasoning tasks, it set new standards with a score of 90.2 on MATH-500. The model also performed exceptionally in coding benchmarks such as LiveCodeBench. Despite these achievements, the training cost was kept relatively low at $5.576 million, requiring only 2.788 million H800 GPU hours. These results highlight DeepSeek-V3’s efficiency and its potential to make high-performance LLMs more accessible.

Conclusion

DeepSeek-V3 represents a meaningful advancement in open-source NLP research. By tackling the computational and architectural challenges associated with large-scale language models, it establishes a new benchmark for efficiency and performance. Its innovative training methods, scalable architecture, and strong evaluation results make it a competitive alternative to proprietary models. DeepSeek-AI’s commitment to open-source development ensures that the broader research community can benefit from its advancements.


Check out the Paper, GitHub Page, and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

🚨 Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

The post DeepSeek-AI Just Released DeepSeek-V3: A Strong Mixture-of-Experts (MoE) Language Model with 671B Total Parameters with 37B Activated for Each Token appeared first on MarkTechPost.

]]>
https://www.marktechpost.com/2024/12/26/deepseek-ai-just-released-deepseek-v3-a-strong-mixture-of-experts-moe-language-model-with-671b-total-parameters-with-37b-activated-for-each-token/feed/ 0 66743