terminal-bench
run terminal-bench
leaderboard
benchmarks
contributors
news
discord
News
Latest updates and announcements from the Terminal-Bench team.
Wed May 06 2026
Release
Terminal-Bench 2.1
A revision of Terminal-Bench 2.0 that fixes 28 tasks and introduces continuous validation for agentic benchmarks.
Sun Apr 19 2026
News
Leaderboard Integrity Update
New policies to address cheating and reward hacking on the Terminal-Bench leaderboard.
Sun Mar 08 2026
News
Terminal-Bench-Science: Now in Development
Extending Terminal-Bench to complex scientific workflow tasks in the natural sciences.
Thu Mar 05 2026
News
Terminal-Bench 3.0 Call for Contributions
Collaborate on creating a new frontier of challenging computer based tasks.
Fri Nov 07 2025
Release
Introducing Terminal-Bench 2.0 and Harbor
A harder, better verified version of Terminal-Bench and a new package evaluating and optimizing agents.
Tue Sep 09 2025
News
Leaderboard Integrity and Timeouts
A clarification on our leaderboard time constraints.
Tue Jul 15 2025
Release
Introducing the Terminal-Bench Dataset Registry with SWE-Bench Verified, AppWorld, DevEval, and EvoEval
An easy way to evaluate agents on popular benchmarks and distribute new benchmarks to agent developers.
Wed Jun 25 2025
News
Warp scores a new SOTA on Terminal-bench
Warp debuts their terminal agent at #1 on Terminal-bench, resolving 52% of tasks
Fri Jun 20 2025
News
Task Spotlight June 20th 2025: Scientific Computing and Cryptography
Highlighting two new tasks for Terminal-Bench
Fri May 23 2025
News
Terminal-Bench on the Claude 4 Model Card
Anthropic features Terminal-Bench in their latest release and sets a new SOTA.
Mon May 19 2025
Release
Introducing Terminal-Bench
An evaluation framework and benchmark to quantify agents' ability to complete complex tasks in the terminal.
Mon May 19 2025
Release
Terminus
A research-preview agent for consistently evaluating the abilities of language models to power autonomous agents in the terminal.