Natural Language Processing Category - MarkTechPost

aiOla Releases Whisper-NER: An Open Source AI Model for Joint Speech Transcription and Entity Recognition

Aswin Ak — Sun, 24 Nov 2024 21:32:32 +0000

Speech recognition technology has made significant progress, with advancements in AI improving accessibility and accuracy. However, it still faces challenges, particularly in understanding spoken entities like names, places, and specific terminology. The issue is not only about converting speech to text accurately but also about extracting meaningful context in real-time. Current systems often require separate tools for transcription and entity recognition, leading to delays, inefficiencies, and inconsistencies. Additionally, privacy concerns regarding the handling of sensitive information during speech transcription present significant challenges for industries dealing with confidential data.

aiOla has released Whisper-NER: an open-source AI model that allows joint speech transcription and entity recognition. This model combines speech-to-text transcription with Named Entity Recognition (NER) to deliver a solution that can recognize important entities while transcribing spoken content. This integration allows for a more immediate understanding of context, making it suitable for industries requiring accurate and privacy-conscious transcription services, such as healthcare, customer service, and legal domains. Whisper-NER effectively combines transcription accuracy with the ability to identify and manage sensitive information.

Technical Details

Whisper-NER is based on the Whisper architecture developed by OpenAI, which is enhanced to perform real-time entity recognition while transcribing. By leveraging transformers, Whisper-NER can recognize entities like names, dates, locations, and specialized terminology directly from the audio input. The model is designed to work in real-time, which is valuable for applications that need instant transcription and comprehension, such as live customer support. Additionally, Whisper-NER incorporates privacy measures to obscure sensitive data, thereby enhancing user trust. The open-source nature of Whisper-NER also makes it accessible to developers and researchers, encouraging further innovation and customization.

The importance of Whisper-NER lies in its capability to deliver both accuracy and privacy. In tests, the model has shown a reduction in error rates compared to separate transcription and entity recognition models. According to aiOla, Whisper-NER provides a nearly 20% improvement in entity recognition accuracy and offers automatic redaction capabilities for sensitive data in real-time. This feature is particularly relevant for sectors like healthcare, where patient privacy must be protected, or for business settings, where confidential client information is discussed. The combination of transcription and entity recognition reduces the need for multiple steps in the workflow, providing a more streamlined and efficient process. It addresses a gap in speech recognition by enabling real-time comprehension without compromising security.

Conclusion

aiOla’s Whisper-NER represents an important step forward for speech recognition technology. By integrating transcription and entity recognition into one model, aiOla addresses the inefficiencies of current systems and provides a practical solution to privacy concerns. Its open-source availability means that the model is not only a tool but also a platform for future innovation, allowing others to build upon its capabilities. Whisper-NER’s contributions to enhancing transcription accuracy, protecting sensitive data, and improving workflow efficiencies make it a notable advancement in AI-powered speech solutions. For industries seeking an effective, accurate, and privacy-conscious solution, Whisper-NER sets a solid standard.

Check out the Paper, Model on Hugging Face, and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[FREE AI VIRTUAL CONFERENCE] SmallCon: Free Virtual GenAI Conference ft. Meta, Mistral, Salesforce, Harvey AI & more. Join us on Dec 11th for this free virtual event to learn what it takes to build big with small models from AI trailblazers like Meta, Mistral AI, Salesforce, Harvey AI, Upstage, Nubank, Nvidia, Hugging Face, and more.

The post aiOla Releases Whisper-NER: An Open Source AI Model for Joint Speech Transcription and Entity Recognition appeared first on MarkTechPost.

NVIDIA Introduces Hymba 1.5B: A Hybrid Small Language Model Outperforming Llama 3.2 and SmolLM v2

Asif Razzaq — Sat, 23 Nov 2024 05:48:05 +0000

Large language models (LLMs) like GPT-4 and Llama-2 are powerful but require significant computational resources, making them impractical for smaller devices. Attention-based transformer models, in particular, have high memory demands and quadratic computational complexity, which limits their efficiency. State Space Models (SSMs), such as Mamba, offer an alternative with lower complexity, but their limited memory recall hampers performance on complex tasks. Existing hybrid models that sequentially combine transformer and SSM layers often lack the synergy needed for optimal performance.

NVIDIA Releases Hymba: A Hybrid-Head Parallel Architecture

NVIDIA has introduced Hymba, a new family of small language models featuring a hybrid architecture that combines Mamba and Attention heads running in parallel. This model, with 1.5 billion parameters, aims to address the efficiency and performance challenges faced by smaller NLP models while being trained on 1.5 trillion tokens.

NVIDIA’s Hymba models feature a hybrid-head parallel architecture that integrates transformer attention mechanisms with SSMs to enhance efficiency. This architecture allows attention heads and SSM heads to process input data in parallel, combining the strengths of both approaches. Attention heads provide high-resolution memory recall, while SSM heads enable efficient context summarization.

Hymba also introduces learnable meta tokens, which are prepended to every input prompt to help store critical information and reduce the burden on attention mechanisms. The model’s architecture is further optimized with cross-layer key-value (KV) sharing and partial sliding window attention to maintain a compact cache size, addressing memory constraints effectively.

Technical Details

The Hymba-1.5B model combines Mamba and attention heads running in parallel with meta tokens to enhance efficiency. This setup reduces the computational load of transformers without compromising memory recall. Hymba includes 16 SSM states and 3 full attention layers, while the rest use sliding window attention to balance efficiency with memory resolution. It also features FlexAttention from PyTorch 2.5, adding flexibility to the model’s training and inference.

A key feature of Hymba is the ability to share the KV cache between multiple layers and between heads within the same layer, significantly reducing memory usage. The combination of sliding window attention and shared KV caches minimizes computational complexity, making Hymba more efficient compared to other models of similar size.

Efficiency, Performance, and Versatility

Hymba demonstrates that small language models can achieve competitive performance while being computationally efficient. In benchmarks, the Hymba-1.5B-Base model outperformed all sub-2B public models and surpassed Llama-3.2-3B with 1.32% higher average accuracy, an 11.67× reduction in cache size, and 3.49× higher throughput. This makes Hymba suitable for deployment on smaller, less capable hardware.

Hymba’s hybrid attention and SSM setup improves performance across a range of tasks, including both general benchmarks and recall-intensive tasks. Its throughput is around 664 tokens per second, significantly higher compared to other models like SmolLM2 or Llama-3.2-3B, which faced out-of-memory issues during similar testing scenarios. These metrics highlight Hymba’s suitability for practical deployment scenarios where both speed and memory efficiency are essential.

Conclusion

NVIDIA’s Hymba family of small language models represents a notable advancement in the efficiency and versatility of NLP technologies. By combining transformer attention and state space models through its hybrid-head parallel architecture, Hymba provides a pathway for deploying effective NLP capabilities on devices with limited resources. The model’s reduced memory requirements, increased throughput, and innovative use of meta tokens and cross-layer KV sharing make it a promising choice for future language model applications where efficiency and accuracy are both critical.

Check out the Paper. For those interested in exploring the Hymba models further, NVIDIA has made them available on Hugging Face: Hymba-1.5B-Base and Hymba-1.5B-Instruct. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

The post NVIDIA Introduces Hymba 1.5B: A Hybrid Small Language Model Outperforming Llama 3.2 and SmolLM v2 appeared first on MarkTechPost.

Fixie AI Introduces Ultravox v0.4.1: A Family of Open Speech Models Trained Specifically for Enabling Real-Time Conversation with LLMs and An Open-Weight Alternative to GPT-4o Realtime

Asif Razzaq — Thu, 14 Nov 2024 03:05:52 +0000

Interacting seamlessly with artificial intelligence in real time has always been a complex endeavor for developers and researchers. A significant challenge lies in integrating multi-modal information—such as text, images, and audio—into a cohesive conversational system. Despite advancements in large language models like GPT-4, many AI systems still encounter difficulties in achieving real-time conversational fluency, contextual awareness, and multi-modal understanding, which limits their effectiveness for practical applications. Additionally, the computational demands of these models make real-time deployment challenging without considerable infrastructure.

Introducing Fixie AI’s Ultravox v0.4.1

Fixie AI introduces Ultravox v0.4.1, a family of multi-modal, open-source models trained specifically for enabling real-time conversations with AI. Designed to overcome some of the most pressing challenges in real-time AI interaction, Ultravox v0.4.1 incorporates the ability to handle multiple input formats, such as text, images, and other sensory data. This latest release aims to provide an alternative to closed-source models like GPT-4, focusing not only on language proficiency but also on enabling fluid, context-aware dialogues across different types of media. By being open-source, Fixie AI also aims to democratize access to state-of-the-art conversation technologies, allowing developers and researchers worldwide to adapt and fine-tune Ultravox for diverse applications—from customer support to entertainment.

Technical Details and Key Benefits

The Ultravox v0.4.1 models are built using a transformer-based architecture optimized to process multiple types of data in parallel. Leveraging a technique called cross-modal attention, these models can integrate and interpret information from various sources simultaneously. This means users can present an image to the AI, type in a question about it, and receive an informed response in real time. The open-source models are hosted on Hugging Face at Fixie AI on Hugging Face, making it convenient for developers to access and experiment with the models. Fixie AI has also provided a well-documented API to facilitate seamless integration into real-world applications. The models boast impressive latency reduction, allowing interactions to take place almost instantly, making them suitable for real-time scenarios like live customer interactions and educational assistance.

Ultravox v0.4.1 represents a notable advancement in conversational AI systems. Unlike proprietary models, which often operate as opaque black boxes, Ultravox offers an open-weight alternative with performance comparable to GPT-4 while also being highly adaptable. Analysis based on Figure 1 from recent evaluations shows that Ultravox v0.4.1 achieves significantly lower response latency—approximately 30% faster than leading commercial models—while maintaining equivalent accuracy and contextual understanding. The model’s cross-modal capabilities make it effective for complex use cases, such as integrating images with text for comprehensive analysis in healthcare or delivering enriched interactive educational content. The open nature of Ultravox facilitates continuous community-driven development, enhancing flexibility and fostering transparency. By mitigating the computational overhead associated with deploying such models, Ultravox makes advanced conversational AI more accessible to smaller entities and independent developers, bridging the gap previously imposed by resource constraints.

Conclusion

Ultravox v0.4.1 by Fixie AI marks a significant milestone for the AI community by addressing critical issues in real-time conversational AI. With its multi-modal capabilities, open-source model weights, and a focus on reducing response latency, Ultravox paves the way for more engaging and accessible AI experiences. As more developers and researchers start experimenting with Ultravox, it has the potential to foster innovative applications across industries that demand real-time, context-rich, and multi-modal conversations.

Check out the Details here, Models on Hugging Face, and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[FREE AI WEBINAR] Implementing Intelligent Document Processing with GenAI in Financial Services and Real Estate Transactions

The post Fixie AI Introduces Ultravox v0.4.1: A Family of Open Speech Models Trained Specifically for Enabling Real-Time Conversation with LLMs and An Open-Weight Alternative to GPT-4o Realtime appeared first on MarkTechPost.

OpenAI Announces SearchGPT Prototype: An AI-Powered Search Engine Transforming Web Searches with Real-time Information and Enhanced Conversational AI Capabilities

Asif Razzaq — Sun, 28 Jul 2024 12:00:00 +0000

OpenAI has recently announced the development of SearchGPT, a groundbreaking prototype that revolutionizes how users search for information online. This new AI-driven search feature combines the strengths of OpenAI’s conversational models with real-time web data, promising to deliver fast, accurate, and contextually relevant answers.

SearchGPT is currently in a testing phase and is available to a limited group of users and publishers. The objective is to gather feedback and refine the feature before its potential integration into ChatGPT. OpenAI envisions SearchGPT as a tool that enhances the search experience and facilitates easier and faster access to information. By joining the waitlist, interested users can participate in this innovative journey.

Traditional web searches often require multiple attempts and considerable effort to find precise results. SearchGPT aims to alleviate this by leveraging AI to understand & respond to user queries conversationally. The system is designed to provide immediate, up-to-date information from the web, complete with clear links to relevant sources. This approach ensures that users receive concise and accurate answers, reducing the time and effort typically associated with online searches.

One of SearchGPT’s standout features is its ability to handle follow-up questions like a human conversation. This interactive capability allows the AI to build context with each query, providing more personalized and relevant responses. This conversational interface is expected to make searching more intuitive and user-friendly.

Image Source

OpenAI is committed to maintaining a thriving ecosystem for publishers and content creators. By integrating high-quality content in a conversational search interface, SearchGPT aims to enhance user engagement and discovery of publisher sites. This collaboration ensures that AI searches respect and promote reliable journalism and content creation.

In addition to enhancing the search experience, OpenAI has introduced mechanisms for publishers to manage their appearance in SearchGPT. This includes options for publishers to control their participation in generative AI training. Importantly, even if sites opt out of generative AI training, they can still appear in search results, ensuring broader inclusion and flexibility for content creators.

OpenAI is also committed to transparency and continuous improvement. Feedback from publishers and users will play a crucial role in refining SearchGPT. OpenAI has opened channels for feedback and is dedicated to sharing insights and performance metrics with publishers, helping them understand and engage effectively with AI search products.

As SearchGPT evolves, OpenAI plans to enhance its capabilities in local information and commerce areas. The ongoing feedback from users and publishers will be instrumental in shaping the final product. OpenAI aims to incorporate the best aspects of the SearchGPT prototype into ChatGPT, further enriching the user experience.

Image Source

In conclusion, SearchGPT, by combining conversational AI with real-time web information, OpenAI is poised to transform how users access & interact with online content. The commitment to partnering with publishers and ensuring high-quality, reliable information underscores OpenAI’s dedication to creating a balanced and effective AI search ecosystem.

Check out the Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

The post OpenAI Announces SearchGPT Prototype: An AI-Powered Search Engine Transforming Web Searches with Real-time Information and Enhanced Conversational AI Capabilities appeared first on MarkTechPost.

Factory AI Introduces ‘Code Droid’ Designed to Automate and Enhance Coding with Advanced Autonomous Capabilities: Achieving 19.27% on SWE-bench Full and 31.67% on SWE-bench Lite

Asif Razzaq — Sun, 23 Jun 2024 05:22:10 +0000

https://www.factory.ai/news/code-droid-technical-report

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2024/06/Screenshot-2024-06-22-at-10.20.05-PM-300x144.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2024/06/Screenshot-2024-06-22-at-10.20.05-PM-1024x492.png" />https://www.factory.ai/news/code-droid-technical-report

Factory AI has released its latest innovation, Code Droid, a groundbreaking AI tool designed to automate and accelerate software development processes. This release signifies a significant advancement in artificial intelligence and software engineering.

Introduction to Code Droid

Code Droid is an autonomous system engineered to execute various coding tasks based on natural language instructions. Its primary function is to automate tedious programming activities, thereby enhancing the productivity and efficiency of software development teams. This innovation stems from Factory AI’s mission to integrate autonomy into software engineering, a vision that necessitates a multidisciplinary approach incorporating insights from robotics, machine learning, and cognitive science.

Core Functionalities of Code Droid

The core functionalities of Code Droid are meticulously designed to address various aspects of software development. Key among these functionalities are:

Planning and Task Decomposition: Code Droid can decompose high-level problems into smaller, manageable subtasks. This capability is crucial for handling complex software development tasks efficiently. By simulating decisions and performing self-criticism, Code Droid can optimize its task execution trajectories.
Tool Integration and Environmental Grounding: Code Droid has access to essential software development tools, including version control systems, editors, linters, and debuggers. This integration ensures that Code Droid operates within the same feedback loops as human developers, facilitating seamless collaboration and iteration.
HyperCode and ByteRank: These systems enable Code Droid to construct a deep understanding of codebases. HyperCode builds multi-resolution representations of engineering systems, while ByteRank retrieves relevant information for specific tasks, ensuring that Code Droid can navigate and manipulate large codebases effectively.
Multi-Model Sampling: Leveraging state-of-the-art large language models, Code Droid can generate multiple solutions for a given task, validate them through testing, and select the optimal solution. This approach enhances the robustness and diversity of Code Droid’s solutions.

Performance on SWE-Bench

Factory AI has rigorously tested Code Droid using SWE-Bench, a benchmark designed to evaluate AI systems’ capabilities in solving real-world software engineering tasks. Code Droid demonstrated exceptional performance, scoring 19.27% on SWE-Bench Full and 31.67% on SWE-Bench Lite. These results highlight Code Droid’s ability to complete complex software development tasks autonomously with high accuracy.

Image Source

Factory’s Code Droid Capabilities

Code Droid is capable of performing several tasks without human intervention, including:

Codebase Modernization: Updating and refactoring legacy codebases to align with modern coding standards and practices.
Feature Development: Implementing new features based on detailed specifications and natural language descriptions.
Proof-of-Concept Creation: Rapidly developing prototypes to validate ideas and concepts.
Building Integrations: Creating and managing integrations between different software systems and APIs.
Automated Code Review: Reviewing code for errors, vulnerabilities, and compliance with coding standards.
End-to-End Software Development: Managing entire software development projects from inception to deployment.

Image Source

Factory AI envisions a future where software development is more efficient, accessible, and creative. The ongoing development of Code Droid focuses on enhancing its cognitive architectures, integrating more sophisticated tools, and fine-tuning its capabilities for specialized domains such as AI development, embedded systems, and financial services. Factory AI’s commitment to innovation extends to continuously calibrating its benchmarking approaches, ensuring that Code Droid remains versatile and effective across various real-world conditions.

In conclusion, Factory AI’s release of Code Droid marks a pivotal moment in the evolution of software engineering. With its advanced capabilities and autonomous functionalities, Code Droid is set to transform software development, bringing unprecedented efficiency and innovation to the industry.

Check out the Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter.

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 45k+ ML SubReddit

The post Factory AI Introduces ‘Code Droid’ Designed to Automate and Enhance Coding with Advanced Autonomous Capabilities: Achieving 19.27% on SWE-bench Full and 31.67% on SWE-bench Lite appeared first on MarkTechPost.

Unveiling the Shortcuts: How Retrieval Augmented Generation (RAG) Influences Language Model Behavior and Memory Utilization

Pragati Jhunjhunwala — Thu, 20 Jun 2024 10:00:00 +0000

https://arxiv.org/abs/2406.12824

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2024/06/Screenshot-2024-06-20-at-12.09.59-AM-300x229.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2024/06/Screenshot-2024-06-20-at-12.09.59-AM.png" />https://arxiv.org/abs/2406.12824

Researchers from Microsoft, the University of Massachusetts, Amherst, and the University of Maryland, College Park, address the challenge of understanding how Retrieval Augmented Generation (RAG) impacts language models’ reasoning and factual accuracy (LMs). The study focuses on whether LMs rely more on the external context provided by RAG than their parametric memory when generating responses to factual queries.

Current methods for improving the factual accuracy of LMs often involve either enhancing the internal parameters of the models or using external retrieval systems to provide additional context during inference. Techniques like ROME and MEMIT focus on editing the model’s internal parameters to update knowledge. However, there has been limited exploration into how these models balance the use of internal (parametric) knowledge and external (non-parametric) context in RAG.

The researchers propose a mechanistic examination of RAG pipelines to determine how much LMs depend on external context versus their internal memory when answering factual queries. They use two advanced LMs, LLaMa-2 and Phi-2, to conduct their analysis, employing methods like Causal Mediation Analysis, Attention Contributions, and Attention Knockouts.

The researchers utilized three key techniques to manage the inner workings of LMs under RAG:

1. Causal tracing identifies which hidden states in the model are crucial for factual predictions. By comparing a corrupted run (where part of the input is deliberately altered) with a clean run and a restoration run (where clean activations are reintroduced into the corrupted run), the researchers measure the Indirect Effect (IE) to determine the importance of specific hidden states.

2. Attention contributions look into the attention weights between the subject token and the last token in the output. This helps by analyzing how much attention each token receives to see if the model relies more on the external context provided by RAG or its internal knowledge.

3. Attention knockouts involve setting critical attention weights to negative infinity to block information flow between specific tokens. By observing the drop in prediction quality when these attention weights are knocked out, the researchers can identify which connections are essential for accurate predictions.

The results revealed that in the presence of RAG context, both LLaMa-2 and Phi-2 models showed a significant decrease in reliance on their internal parametric memory. The Average Indirect Effect of subject tokens in the query was notably lower when RAG context was present. Additionally, the last token residual stream derived more enriched information from the attribute tokens in the context rather than the subject tokens in the query. Attention Contributions and Knockouts further confirmed that the models prioritized external context over internal memory for factual predictions. However, the exact nature of how this approach works isn’t clearly understood.

In conclusion, the proposed method demonstrates that language models present a “shortcut” behavior, heavily relying on the external context provided by RAG over their internal parametric memory for factual queries. By mechanistically analyzing how LMs process and prioritize information, the researchers provide valuable insights into the interplay between parametric and non-parametric knowledge in retrieval-augmented generation. The study highlights the need for understanding these dynamics to improve model performance and reliability in practical applications.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter.

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 44k+ ML SubReddit

The post Unveiling the Shortcuts: How Retrieval Augmented Generation (RAG) Influences Language Model Behavior and Memory Utilization appeared first on MarkTechPost.

Spotify’s Newest Feature: Using AI to Clone and Translate Podcast Voices Across Languages

Niharika Singh — Tue, 03 Oct 2023 16:46:33 +0000

In the ever-evolving world of podcasting, language barriers have long stood as a formidable obstacle to the global reach of audio content. However, recent developments signal a promising solution to this challenge. Spotify, the streaming giant, has partnered with OpenAI to introduce a groundbreaking AI-powered voice translation tool that has the potential to revolutionize the way podcast episodes are consumed around the world.

Traditionally, podcasts have faced linguistic limitations, with content primarily accessible to audiences fluent in the language of the podcast. While subtitles and dubbing have been employed to bridge this gap, they often need to deliver an authentic experience. This longstanding problem has prompted content creators and platforms to seek innovative solutions.

Spotify’s voice translation technology is a remarkable development that leverages OpenAI’s cutting-edge voice technology. This tool transcends conventional translation methods by crafting synthetic voices that mimic the podcast hosts’ cadence, tone, and inflection. It promises to maintain the essence of the original content while breaking down language barriers and expanding the global audience for podcasts.

This technology uses just a few seconds of a host’s real speech to create translated podcast episodes that sound remarkably authentic and personalized. This innovation, tested with prominent podcasters, aims to offer listeners the same unique voice experience in Spanish, French, and German. As the pilot program progresses, more shows and languages will undoubtedly be added, marking a significant stride toward making podcasts accessible to a broader global audience.

Spotify’s commitment to democratizing podcast content is evident in its decision to offer these translated episodes to free and Premium users. This inclusivity underscores the company’s dedication to enhancing creator expression and building connections between talent and fans worldwide. The success and user reception of these AI-powered episodes will shape the direction of future refinements, promising even more innovative solutions for the podcasting landscape.

In conclusion, Spotify’s introduction of AI-powered voice translation technology signifies a monumental step in overcoming the longstanding barriers to storytelling imposed by language differences. By preserving the authenticity of podcast hosts’ voices in translated content, Spotify aims to bring global listeners closer to their favorite podcasters. As Spotify continues to expand its podcast catalog, innovations like voice translation could make this captivating medium more accessible and inclusive globally, marking a promising new chapter in the world of podcasting.

Check out the Spotify Article. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

The post Spotify’s Newest Feature: Using AI to Clone and Translate Podcast Voices Across Languages appeared first on MarkTechPost.

Meet Brain2Music: An AI Method for Reconstructing Music from Brain Activity Captured Using Functional Magnetic Resonance Imaging (fMRI)

Mohammad Arshad — Tue, 25 Jul 2023 16:00:00 +0000

https://arxiv.org/abs/2307.11078

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2023/07/Screenshot-2023-07-25-at-2.37.12-AM-300x195.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2023/07/Screenshot-2023-07-25-at-2.37.12-AM-1024x664.png" />https://arxiv.org/abs/2307.11078

Who doesn’t love music? Have you ever remembered the rhythm of a song but not the lyrics and can’t figure out the song’s name? Researchers at Google and Osaka University together found a way to reconstruct the music from brain activity using functional magnetic resonance imaging (fMRI). Based on one’s genre, instrumentation, and mood, the music is generated.

Researchers at Google and Osaka University use deep neural networks to generate music from features like fMRI scans by predicting high-level, semantically structured music. Based on the activity in the human auditory cortex, different components of the music can be predicted. Researchers experimented with JukeBox, which generated music with high temporal coherence, which consists of predictable artifacts. A compressed neural audio codec at low bitrates with high-quality reconstruction is used to generate high-quality audio.

Generating music from fMRI requires intermediate stages, which include music representation by selecting the music embedding. The architecture used by them consisted of music embedding, which represented a bottleneck for subsequent music generation. If the predicted music embedding is close to the music embedding of the original stimulus heard by the subject, MusicLM (music generating model) is used to generate music similar to the original stimulus.

The music-generating model MusicLM consists of audio-derived embeddings named MuLan and w2v-BERT- avg. Out of both embeddings, MuLan tends to have high prediction performance than w2v-BERT-avg in the lateral prefrontal cortex as it captures high-level music information processing in the human brain. Abstract information about music is differently represented in the auditory cortex compared to audio-derived embeddings.

MuLan embeddings are converted into music using generating models. The information which is not contained in the embedding is regained in the model. In the retrieval technique, the reconstruction is also musical as it is directly pulled from a dataset of music. This ensures a higher level of reconstruction quality. Researchers use Linear regression from fMRI response data. This method also has limitations which include uncertainty in the amount of exact information with linear regression from the fMRI data.

Researchers said that their future work includes the reconstruction of music from an individual’s imagination. When a user imagines a music clip, the decoding analysis examines how faithfully the imagination can be reconstructed. This would qualify for an actual mind-reading label. There exist diverse subjects with different musical expertise and it requires multiple reconstruction properties by comparison. Comparing the reconstruction quality between the subjects, which included professional musicians, can provide useful insights into the differences in their perspectives and understanding.

Their research work is just the first step in bringing your pure, imaginative thoughts into existence. This would also lead to generating holograms from just pure imagination in the mind of the subject. Advancement in this field will also provide a quantitative interpretation from a biological perspective.

Check out the Paper and Project Page. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post Meet Brain2Music: An AI Method for Reconstructing Music from Brain Activity Captured Using Functional Magnetic Resonance Imaging (fMRI) appeared first on MarkTechPost.

Breaking Down AutoGPT: What It Is, Its Features, Limitations, Artificial General Intelligence (AGI) And Impact of Autonomous Agents on Generative AI

Tanya Malhotra — Tue, 11 Jul 2023 08:00:00 +0000

Introduction

Generative AI is evolving and getting popular. Since its introduction, new models and research papers are getting released almost every other day. The major reason for the exponentially increasing popularity is the development of Large Language Models. LLMs, the Artificial Intelligence models that are designed to process natural language and generate human-like responses, are trending. The best example is OpenAI’s ChatGPT, the well-known chatbot that does everything from content generation and code completion to question answering, just like a human. Even OpenAI’s DALL-E and Google’s BERT have contributed to making significant advances in recent times.

What is AutoGPT?

Recently, a new AI tool has been released, which has even more potential than ChatGPT. Called AutoGPT, this tool performs human-level tasks and uses the capabilities of GPT-4 to develop an AI agent that can function independently without user interference. GPT 4, which is the latest add-on to OpenAI’s deep learning models, is multimodal in nature. Unlike the previous version, GPT 3.5, which only lets ChatGPT take textual inputs, the latest GPT-4 accepts text and images both as input. Auto-GPT, the free-of-cost and open-source in nature Python application, uses GPT-4 technology.

AutoGPT uses the concept of stacking to recursively call itself. Stacking is an approach that lets AI models use other models as tools or mediums to accomplish a task. AutoGPT using this method and with the help of both GPT 3.5 and GPT 4, creates full projects by iterating on its own prompts.

Artificial General Intelligence (AGI) in AutoGPT

AutoGPT’s abilities make it a promising application that makes it an example of “Artificial General Intelligence” or AGI. This type of technology represents a significant breakthrough in the field of AI, as it has the potential to develop machines that can understand and learn intellectual tasks like humans. AGI can perform a wide range of tasks and find solutions when faced with unfamiliar tasks. It is designed to be able to learn and adapt to new situations and environments without the need for specific prompts or instructions for each new task.

Features of AutoGPT

AutoGPT’s access to GPT-4 makes it a great tool for high-quality text generation. It even has access to popular websites and platforms, which helps in its better interaction and better ability to perform various tasks. AutoGPT manages both short-term and long-term memory and has internet connectivity for searching the internet and gathering information. Moreover, due to the power of GPT 3.5, AutoGPT has file storage and summarization capabilities and can even use DALL-E for image generation.

Some examples of AutoGPT’s capabilities have been shared on Twitter, which include creating a “Do anything machine” that spawns a GPT-4 agent to complete any task added to the task list. It can also read recent events and prepare a podcast outline. AutoGPT even enables the creation of an “AgentGPT,” where an AI agent is given a goal, comes up with an execution plan, and takes action. It even created a website using React and Tailwind CSS in under three minutes.

What is BabyAGI?

BabyAGI combines OpenAI’s GPT-4 with LangChain, a coding framework, and Pinecone, a vector database, to spawn new agents that can complete complex tasks while considering the original objective. Inspired by Artificial General Intelligence, BabyAGI imitates humans and uses its long-term memory to store and retrieve information quickly. BabyAGI basically trains and evaluates various AI agents in a simulated environment and tests their ability to learn and perform tough tasks.

How autonomous agents are introducing generative AI to the masses?

AI agents, the computer programs that interact with the environment to make decisions operate autonomously, or interact with humans or other agents using natural language. Used in a wide range of applications, such as customer service, personal assistants, gaming, and robotics, an AI agent is classified based on several criteria, such as autonomy, reactivity, proactiveness, environment, and flexibility. Designing and implementing an AI agent involves identifying the problem domain, choosing an appropriate architecture, defining goals and actions, implementing the agent’s logic, and testing and debugging.

AutoGPT is an example of an AI agent that uses generative AI to solve problems. It operates autonomously and has the potential to revolutionize many industries. It even raises concerns about the impact of autonomous AI agents on human jobs, privacy, and security. It is important to carefully consider these implications and ensure that AI agents are developed and used responsibly.

Limitations of AutoGPT

Auto-GPT is a powerful tool but comes with a significant obstacle. Its adoption in production environments is difficult due to its high cost. Each step requires a call to the GPT-4 model, which is an expensive process that often maxes out tokens to provide better reasoning. The cost of GPT-4 tokens is not cheap, and according to OpenAI, the GPT-4 model with an 8K context window charges $0.03 per 1,000 tokens for prompts and $0.06 per 1,000 tokens for results.

Auto-GPT uses GPT-4 and a simple programming language to perform tasks. The range of functions provided by Auto-GPT is limited. The functions include searching the web, managing memory, interacting with files, executing code, and generating images, but they narrow down the range of tasks Auto-GPT can solve effectively. Also, the decomposition and reasoning abilities of GPT-4 are still constrained, which further limits Auto-GPT’s problem-solving capabilities.

Conclusion

AutoGPT’s ability to perform a wide range of tasks and generate creative ideas makes it a promising tool in the field of AI. Its performance may be limited in complex real-world business scenarios, but if the tool continues to develop and improve, it has the potential to become even more powerful and versatile.

Don’t forget to join our 19k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club

References:

https://www.fastcompany.com/90880294/auto-gpt-and-babyagi-how-autonomous-agents-are-bringing-generative-ai-to-the-masses
https://www.livemint.com/technology/tech-news/meet-autogpt-the-autonomous-gpt-4-tool-revolutionizing-ai-11681358612615.html
https://dataconomy.com/2023/04/what-is-autogpt-and-how-to-use-ai-agents/
https://jina.ai/news/auto-gpt-unmasked-hype-hard-truths-production-pitfalls/
https://mpost.io/what-makes-autogpt-so-special/

The post Breaking Down AutoGPT: What It Is, Its Features, Limitations, Artificial General Intelligence (AGI) And Impact of Autonomous Agents on Generative AI appeared first on MarkTechPost.

Meet XTREME-UP: A Benchmark for Evaluating Multilingual Models with Scarce Data Evaluation, Focusing on Under-Represented Languages

Tanya Malhotra — Wed, 24 May 2023 13:06:12 +0000

https://storage.googleapis.com/xtreme-up/xtreme-up.pdf

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2023/05/Screenshot-2023-05-24-at-6.33.58-PM-300x132.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2023/05/Screenshot-2023-05-24-at-6.33.58-PM-1024x451.png" />https://storage.googleapis.com/xtreme-up/xtreme-up.pdf

The fields of Artificial Intelligence and Machine Learning are solely dependent upon data. Everyone is deluged with data from different sources like social media, healthcare, finance, etc., and this data is of great use to applications involving Natural Language Processing. But even with so much data, readily usable data is scarce for training an NLP model for a particular task. Finding high-quality data with usefulness and good-quality filters is a difficult task. Specifically talking about developing NLP models for different languages, the lack of data for most languages comes as a limitation that hinders progress in NLP for under-represented languages (ULs).

The emerging tasks like news summarization, sentiment analysis, question answering, or the development of a virtual assistant all heavily rely on data availability in high-resource languages. These tasks are dependent upon technologies like language identification, automatic speech recognition (ASR), or optical character recognition (OCR), which are mostly unavailable for under-represented languages, to overcome which it is important to build datasets and evaluate models on tasks that would be beneficial for UL speakers.

Recently, a team of researchers from GoogleAI has proposed a benchmark called XTREME-UP (Under-Represented and User-Centric with Paucal Data) that evaluates multilingual models on user-centric tasks in a few-shot learning setting. It primarily focuses on activities that technology users often perform in their day-to-day lives, such as information access and input/output activities that enable other technologies. The three main features that distinguish XTREME-UP are – its use of scarce data, its user-centric design, and its focus on under-represented languages.

With XTREME-UP, the researchers have introduced a standardized multilingual in-language fine-tuning setting in place of the conventional cross-lingual zero-shot option. This method considers the amount of data that can be generated or annotated in an 8-hour period for a particular language, thus aiming to give the ULs a more useful evaluation setup.

XTREME-UP assesses the performance of language models across 88 under-represented languages in 9 significant user-centric technologies, some of which include Automatic Speech Recognition (ASR), Optical Character Recognition (OCR), Machine Translation (MT), and information access tasks that have general utility. The researchers have developed new datasets specifically for operations like OCR, autocomplete, semantic parsing, and transliteration in order to evaluate the capabilities of the language models. They have also improved and polished the currently existing datasets for other tasks in the same benchmark.

XTREME-UP has one of its key abilities to assess various modeling situations, including both text-only and multi-modal scenarios with visual, audio, and text inputs. It also offers methods for supervised parameter adjustment and in-context learning, allowing for a thorough assessment of various modeling approaches. The tasks in XTREME-UP involve enabling access to language technology, enabling information access as part of a larger system such as question answering, information extraction, and virtual assistants, followed by making information accessible in the speaker’s language.

Consequently, XTREME-UP is a great benchmark that addresses the data scarcity challenge in highly multilingual NLP systems. It is a standardized evaluation framework for under-represented language and seems really useful for future NLP research and developments.

Check out the Paper and Github. Don’t forget to join our 21k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com

Check Out 100’s AI Tools in AI Tools Club

The post Meet XTREME-UP: A Benchmark for Evaluating Multilingual Models with Scarce Data Evaluation, Focusing on Under-Represented Languages appeared first on MarkTechPost.