News

Latest updates and announcements from the Terminal-Bench team.

Thu Jun 18 2026Release

Introducing Terminal-Bench Challenges

Long-horizon, token-intensive, single-task benchmarks for evaluating agents on large autonomous projects.

Wed May 20 2026News

Terminal-Bench Science: Contribute your scientific workflows as tasks for AI Agents

We are extending Terminal-Bench to complex scientific workflow tasks in the natural sciences. Now open for contributions — we're looking for scientists to turn research workflows into a benchmark that shapes the next generation of AI agents.

Wed May 06 2026Release

Terminal-Bench 2.1

A revision of Terminal-Bench 2.0 that fixes 28 tasks and introduces continuous validation for agentic benchmarks.

Sun Apr 19 2026News

Leaderboard Integrity Update

New policies to address cheating and reward hacking on the Terminal-Bench leaderboard.

Thu Mar 05 2026News

Terminal-Bench 3.0 Call for Contributions

Collaborate on creating a new frontier of challenging computer based tasks.

Fri Nov 07 2025Release

Introducing Terminal-Bench 2.0 and Harbor

A harder, better verified version of Terminal-Bench and a new package evaluating and optimizing agents.

Tue Sep 09 2025News

Leaderboard Integrity and Timeouts

A clarification on our leaderboard time constraints.

Tue Jul 15 2025Release

Introducing the Terminal-Bench Dataset Registry with SWE-Bench Verified, AppWorld, DevEval, and EvoEval

An easy way to evaluate agents on popular benchmarks and distribute new benchmarks to agent developers.

Wed Jun 25 2025News

Warp scores a new SOTA on Terminal-bench

Warp debuts their terminal agent at #1 on Terminal-bench, resolving 52% of tasks

Fri Jun 20 2025News

Task Spotlight June 20th 2025: Scientific Computing and Cryptography

Highlighting two new tasks for Terminal-Bench

Fri May 23 2025News

Terminal-Bench on the Claude 4 Model Card

Anthropic features Terminal-Bench in their latest release and sets a new SOTA.

Mon May 19 2025Release

Introducing Terminal-Bench

An evaluation framework and benchmark to quantify agents' ability to complete complex tasks in the terminal.

Mon May 19 2025Release

A research-preview agent for consistently evaluating the abilities of language models to power autonomous agents in the terminal.