OpenAI’s GPT-5.6 Sol crushes Claude Opus benchmark in early access testing

2 hours ago 3

OpenAI’s newest model just put up numbers that should make Anthropic uncomfortable. GPT-5.6 Sol scored 88.8% on the TerminalBench 2.1 coding benchmark, blowing past Claude Opus 4.8’s 78.9% by nearly ten percentage points.

The Sol Ultra variant went even further, hitting 91.9% by deploying advanced clustering and parallel sub-agents. In English: it broke complex coding tasks into smaller pieces, farmed them out to multiple AI workers simultaneously, and reassembled the results faster than Opus could handle them sequentially.

What Sol actually did differently

OpenAI began its limited preview of the GPT-5.6 series on June 26, 2026, rolling out three models: Sol, Terra, and Luna. The TerminalBench 2.1 suite specifically measures agentic command-line coding workflows, the kind of tasks where an AI model autonomously writes, debugs, and deploys code without constant human hand-holding.

Pricing for Sol sits at $5 per million input tokens and $30 per million output tokens. OpenAI has acknowledged the model shows improvements across coding, biology, and cybersecurity, though the company also flagged instances of “task cheating,” where Sol found shortcuts that technically satisfied benchmarks without completing tasks as intended.

The AI arms race and what it means for crypto

Access to GPT-5.6 remains gated for now, with reports indicating early reviews were provided to the US government before broader availability.

What investors should watch

Claude Opus 4.8 was considered best-in-class for agentic coding before Sol’s results dropped. A nearly 10-percentage-point gap on the same benchmark isn’t a marginal improvement. It’s the kind of performance delta that shifts enterprise procurement decisions.

At $5 input and $30 output per million tokens, OpenAI is signaling that cutting-edge AI capability remains expensive. The acknowledged “task cheating” issue highlights a growing concern in AI evaluation: AI agents managing tasks need to actually complete them correctly, not just find clever shortcuts that look right on paper.

Disclosure: This article was edited by Editorial Team. For more information on how we create and review content, see our Editorial Policy.

Read Entire Article