ServiceNow Releases AgentLab: A New Open-Source Python Package for Developing and Evaluating Web Agents

Developing web agents is a challenging area of AI research that has attracted significant attention in recent years. As the web becomes more dynamic and complex, it demands advanced capabilities from agents that interact autonomously with online platforms. One of the major challenges in building web agents is effectively testing, benchmarking, and evaluating their behavior in diverse and realistic online environments. Many existing frameworks for agent development have limitations such as poor scalability, difficulty in conducting reproducible experiments, and challenges in integrating with various language models and benchmark environments. Additionally, running large-scale, parallel experiments has often been cumbersome, especially for teams with limited computational resources or fragmented tools.

ServiceNow addresses these challenges by releasing AgentLab, an open-source package designed to simplify the development and evaluation of web agents. AgentLab offers a range of tools to streamline the process of creating web agents capable of navigating and interacting with various web platforms. Built on top of BrowserGym, another recent development from ServiceNow, AgentLab provides an environment for training and testing agents across a variety of web benchmarks, including the popular WebArena. With AgentLab, developers can run large-scale experiments in parallel, allowing them to evaluate and improve their agents’ performance across different tasks more efficiently. The package aims to make the agent development process more accessible for both individual researchers and enterprise teams.

Technical Details

AgentLab is designed to address common pain points in web agent development by offering a unified and flexible framework. One of its standout features is the integration with Ray, a library for parallel and distributed computing, which simplifies running large-scale parallel experiments. This feature is particularly useful for researchers who want to test multiple agent configurations or train agents across different environments simultaneously.

AgentLab also provides essential building blocks for creating agents using BrowserGym, which supports ten different benchmarks. These benchmarks serve as standardized environments to test agent capabilities, including WebArena, which evaluates agents’ performance on web-based tasks that require human-like interaction.

Another key advantage is the Unified LLM API offered by AgentLab. This API allows seamless integration with popular language models like OpenAI, Azure, and OpenRouter, and it also supports self-hosted models using Text Generation Inference (TGI). This flexibility enables developers to easily choose and switch between different large language models (LLMs) without additional configuration, thereby speeding up the agent development process. The unified leaderboard feature also adds value by providing a consistent way to compare agents’ performances across multiple tasks. Furthermore, AgentLab emphasizes reproducibility, offering built-in tools to help developers recreate experiments accurately, which is crucial for validating results and improving agent robustness.

Since its release, AgentLab has proven effective in helping developers scale up the process of creating and evaluating web agents. By leveraging Ray, users have been able to conduct large-scale parallel experiments that would have otherwise required extensive manual setup and substantial computational resources. BrowserGym, which serves as the foundation for AgentLab, has supported experimentation across ten benchmarks, including WebArena—a benchmark designed to test agent performance in dynamic web environments that mimic real-world websites.

Developers using AgentLab have reported improvements in both the efficiency and effectiveness of their experiments, especially when leveraging the Unified LLM API to switch between different language models seamlessly. These features not only accelerate development but also provide meaningful comparisons through a unified leaderboard, offering insights into the strengths and weaknesses of different web agent architectures.

Conclusion

ServiceNow’s AgentLab is a thoughtful open-source package for developing and evaluating web agents, addressing key challenges in this field. By integrating BrowserGym, Ray, and a Unified LLM API, AgentLab simplifies large-scale experimentation and benchmarking while ensuring consistency and reproducibility. The flexibility to switch between different language models and the ability to run extensive experiments in parallel make AgentLab a valuable tool for both individual developers and larger research teams.

Features like the unified leaderboard help standardize agent evaluation and foster a community-driven approach to agent benchmarking. As web automation and interaction become increasingly important, AgentLab offers a solid foundation for developing capable, efficient, and adaptable web agents.


Check out the GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 60k+ ML SubReddit.

🚨 [Must Attend Webinar]: ‘Transform proofs-of-concept into production-ready AI applications and agents’ (Promoted)

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

🧵🧵 [Download] Evaluation of Large Language Model Vulnerabilities Report (Promoted)