terminal-bench
View and compare agent performance across Terminal-Bench versions and challenges.
Terminal-Bench 2.0. Submissions must use terminal-bench/terminal-bench-2 via Harbor.
Terminal-Bench 2.1. Submissions must use terminal-bench/terminal-bench-2-1 via Harbor.
Legacy version of Terminal-Bench. Submissions must use terminal-bench-core==0.1.1.
The next frontier benchmark for terminal agents. Currently in development.
A domain-specific benchmark for scientific computing in terminal environments. Currently in development.
Single-task challenge leaderboards for inference engine code golf, Rust compiler speedup, and WASM rendering.