“Alibaba Unveils Tongyi DeepResearch: A 30B-Parameter Open-Source Agentic LLM Tailored for Extended Research Horizons”

September 19, 2025

388

Tongyi DeepResearch has demonstrated state-of-the-art results on various agentic search suites designed to evaluate “deep research” agents. These include:

Alibaba’s Tongyi Lab has unveiled Tongyi-DeepResearch-30B-A3B, a groundbreaking large language model tailored for extended, tool-assisted information quests on the web. This model, built with a mixture-of-experts (MoE) design, boasts approximately 30.5 billion total parameters and activates around 3 to 3.3 billion per token, ensuring high throughput while maintaining robust reasoning capabilities. It is designed to handle multi-turn research workflows, including searching, browsing, extracting, cross-verifying, and synthesizing evidence, all under a React-style tool usage paradigm and a heavy test-time scaling mode. The release includes model weights (under Apache-2.0 license), inference scripts, and evaluation utilities.

Benchmark Results: State-of-the-Art Performance

– Humanity’s Last Exam (HLE): 32.9
– BrowseComp: 43.4 (English) and 46.7 (Chinese)
– xbench-DeepSearch: 75

It has also shown strong performance across WebWalkerQA, GAIA, FRAMES, and SimpleQA. The team behind the model finds it on par with OpenAI-style deep research agents and systematically outperforming existing proprietary and open-source agents across these tasks.

Architecture and Inference Profile

The MoE routing in Tongyi DeepResearch, following the Qwen3-MoE lineage, allows it to have the cost envelope of a smaller dense model while retaining specialist capacity. With a context length of 128,000 tokens, it is well-suited for long, tool-augmented browsing sessions and iterative synthesis. The model operates in two inference modes:

1. ReAct (native): For direct evaluation of intrinsic reasoning and tool use.
2. IterResearch “Heavy” mode: For test-time scaling, featuring structured multi-round synthesis and reconstruction of context to minimize noise accumulation.

Training Pipeline: Synthetic Data and On-Policy RL

Tongyi DeepResearch is trained end-to-end as an agent, not just a chat LLM, using a fully automated, scalable data engine. This includes:

– Agentic continual pre-training (CPT): Large-scale synthetic trajectories built from curated corpora, historical tool traces, and graph-structured knowledge to teach retrieval, browsing, and multi-source fusion.
– Agentic SFT cold-start: Trajectories in ReAct and IterResearch formats for schema-consistent planning and tool use.
– On-policy RL with Group Relative Policy Optimization (GRPO): Token-level policy gradients, leave-one-out advantage estimation, and negative-sample filtering to stabilize learning in non-stationary web environments.

Role in Document and Web Research Workflows

Deep-research tasks demand four key capabilities: long-horizon planning, iterative retrieval and verification, low hallucination rates, and synthesis under large contexts. The IterResearch rollout mitigates context bloat and error propagation by restructuring context each “round,” while the ReAct baseline demonstrates that the behaviors are learned rather than prompt-engineered. The reported scores on HLE and BrowseComp suggest improved robustness on multi-hop, tool-mediated queries where prior agents often struggled with overfitting to prompt patterns or saturating at low depths.

Key Features of Tongyi DeepResearch-30B-A3B

– MoE efficiency at scale: Approximately 30.5 billion total parameters with 3.0 to 3.3 billion activated per token, enabling small-model inference cost with large-model capacity.
– Extended context window: 128,000 tokens for long-horizon rollouts with evidence accumulation in multi-step web research.
– Dual inference paradigms: Native ReAct for intrinsic tool-use evaluation and IterResearch “Heavy” for deeper multi-round synthesis.
– Automated agentic data engine: A fully automated synthesis pipeline powering agentic continual pre-training (CPT), supervised fine-tuning (SFT), and RL.
– On-policy RL with GRPO: Group Relative Policy Optimization with token-level policy gradients, leave-one-out advantage estimation, and selective negative-sample filtering for stability.
– Reported SOTA on deep-research suites: HLE 32.9, BrowseComp 43.4 (English) / 46.7 (Chinese), xbench-DeepSearch 75; strong results on WebWalkerQA, GAIA, FRAMES, and SimpleQA.

Summary

Tongyi DeepResearch-30B-A3B offers a comprehensive open-source stack, packaging a MoE architecture, extended context, dual ReAct/IterResearch rollouts, and an automated agentic data + GRPO RL pipeline. For teams developing long-horizon research agents, it provides a practical balance of inference cost and capability, with reported strong performance on deep-research benchmarks. You can explore the models on Hugging Face, visit the GitHub page, and delve into technical details. For tutorials, codes, and notebooks, check out the GitHub page. Additionally, follow the project on Twitter, join the 100k+ ML SubReddit, and subscribe to the newsletter for the latest updates.