šŸš€ Big News! ServiceNow Research has just dropped DRBench, a game-changer in the world of AI research agents. This isn’t your average benchmark; it’s a realistic, runnable environment designed to put ā€œdeep researchā€ agents through their paces on complex, open-ended enterprise tasks. šŸ¢šŸ”

So, what’s DRBench all about?

DRBench is here to evaluate AI agents on tasks that matter to businesses. It’s about synthesizing facts from both the public web and private organizational data into well-cited reports. But here’s the twist: agents have to navigate through a maze of enterprise-style workflows, dealing with files, emails, chat logs, and cloud storage. No more easy web-only tests! šŸŒšŸ“„

What’s inside DRBench?

The initial release comes packed with:
– 15 deep research tasks across 10 enterprise domains (from Sales to Cybersecurity).
– Each task has a deep research question, a task context (company and persona), and groundtruth insights hidden within realistic enterprise files and apps.
– A total of 114 groundtruth insights across tasks, verified by humans.

The Enterprise Environment

DRBench comes with a containerized enterprise environment that integrates commonly used services behind authentication and app-specific APIs. It’s like giving your AI agent a realistic workplace to operate in! šŸ’¼šŸ”‘

What gets scored?

DRBench evaluates AI agents on four key axes:
1. Insight Recall: How well the agent finds and reports the relevant insights.
2. Distractor Avoidance: How well the agent ignores irrelevant or distracting information.
3. Factuality: How accurate the final report is.
4. Report Quality: How well-structured and clear the final report is.

Meet DRBench Agent (DRBA)

The research team has introduced a task-oriented baseline agent, DRBA, designed to operate natively inside the DRBench environment. DRBA is organized into four components: research planning, action planning, a research loop with Adaptive Action Planning (AAP), and report writing. It’s like having a colleague to compare your AI agent’s performance with! šŸ¤šŸ’»

Why DRBench is a Game-Changer

Most ā€œdeep researchā€ agents shine on public-web question sets, but production usage requires finding the right internal needles, ignoring distractors, and citing both public and private sources under enterprise constraints. DRBench directly targets this gap, making it a practical benchmark for system builders who need end-to-end evaluation. 🌟

Key Takeaways

– DRBench evaluates deep research agents on complex, open-ended enterprise tasks that require combining public web and private company data.
– The initial release covers 15 tasks across 10 domains, each grounded in realistic user personas and organizational context.
– Tasks span heterogeneous enterprise artifacts and the open web, going beyond web-only setups.
– Reports are scored for insight recall, factual accuracy, and coherent, well-structured reporting using rubric-based evaluation.
– Code and benchmark assets are open-sourced on GitHub for reproducible evaluation and extension.

Wanna know more?

Check out the [Paper](https://arxiv.org/abs/2510.00172) and [GitHub page](https://github.com/ServiceNow/drb) for more details. And while you’re at it, feel free to follow us on [Twitter](https://twitter.com/ServiceNow) and join our [100k+ ML SubReddit](https://www.reddit.com/r/MachineLearning/) and [subscribe to our Newsletter](https://www.servicenow.com/subscribe.html). Oh, and we’re on [Telegram](https://t.me/ServiceNowAI) now too! šŸ“¢šŸ“£

Share.
Leave A Reply

Exit mobile version