Agents-A1 achieves 1T-model performance through long-task training, not bigger parameters

3 weeks ago 32

For years, the AI industry has operated under a simple gospel: bigger models equal better results. A team from Shanghai AI Laboratory just published a paper that politely disagrees.

Their model, Agents-A1, packs 35 billion parameters into a Mixture-of-Experts architecture. It matches, and in several benchmarks outperforms, models roughly 30 times its size. The trick wasn’t scaling up. It was scaling out, training the model on longer, more complex task sequences rather than inflating the parameter count.

How a 35B model punches above its weight

The model goes through a three-stage training protocol. First comes full-domain supervised fine-tuning, where the model learns across a broad set of tasks. Then it trains with domain-level teacher models, essentially learning from specialized experts. Finally, a multi-teacher on-policy distillation stage lets the model absorb knowledge from multiple teachers simultaneously while generating its own outputs.

There’s also a domain-grounded knowledge-action framework baked into the architecture. This gives the model a structured way to make decisions based on actions, observations, and verified outcomes.

The benchmark numbers tell the story

Agents-A1 posted a score of 56.4 on SEAL-0, a benchmark designed to evaluate complex agent capabilities. On IFBench, which tests instruction-following ability, it hit 80.6. And on GAIA, a benchmark that measures general AI assistant performance, it scored 96.0.

The model also supports advanced capabilities like tool usage and function calling. These are critical for real-world agent applications where a model needs to interact with APIs, databases, or external software rather than just generating text in a vacuum.

Why this matters beyond the benchmarks

Agents-A1 was developed by InternScience, part of the AI for Science Center at Shanghai AI Laboratory. It was released on June 30, 2026, under an Apache-2.0 open-source license. The team also published evaluation code and designed the model for compatibility with popular serving frameworks like vLLM and SGLang.

The paper’s title, “Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent,” neatly captures the thesis. The AI industry has spent billions chasing parameter counts as the primary scaling axis. This work suggests there’s a second axis, task horizon length, that may deliver comparable gains at a fraction of the cost.

Disclosure: This article was edited by Editorial Team. For more information on how we create and review content, see our Editorial Policy.

Read Entire Article