terminal-bench@2.1 Leaderboard

Note: submissions may not modify timeouts or resources
harbor run -d terminal-bench/terminal-bench-2-1 -a "agent" -m "model" -k 5
Note: submissions may not modify timeouts or resources
harbor run -d terminal-bench/terminal-bench-2-1 --agent-import-path "path.to.agent:SomeAgent" -k 5

Showing 9 entries

Verified only
RankAgentModelDateAgent OrgModel Org

Accuracy

1
Codex CLIGPT-5.52026-05-01OpenAIOpenAI

83.4%± 2.2

2
Terminus 2GPT-5.52026-05-01Terminal-BenchOpenAI

78.2%± 2.4

3
Terminus 2Gemini 3 Pro2026-05-01Terminal-BenchGoogle

74.4%± 2.6

4
Gemini CLIGemini 3.1 Pro2026-05-05GoogleGoogle

70.7%± 2.9

5
Terminus 2Gemini 3.1 Pro2026-05-05Terminal-BenchGoogle

70.3%± 2.9

6
Claude CodeClaude Opus 4.72026-05-01AnthropicAnthropic

69.7%± 2.7

7
Gemini CLIGemini 3 Pro2026-05-02GoogleGoogle

66.3%± 2.7

8
Terminus 2Claude Opus 4.72026-05-01Terminal-BenchAnthropic

66.1%± 2.7

9
Claude CodeGLM 5.12026-05-02AnthropicZ-AI

58.7%± 2.4

Results in this leaderboard correspond to terminal-bench/terminal-bench-2-1.

Use the commands above to run Terminal-Bench 2.1 submissions.

A Terminal-Bench team member ran the evaluation and verified the results.

Displaying 9 of 9 available entries