terminal-bench
science
A Benchmark for Evaluating AI Agents on Computational Workflows in the Natural Sciences
We're excited to announce that Terminal-Bench-Science is now in development — extending Terminal-Bench to the complex real-world computational workflows that natural scientists run in their research labs.
About
What is Terminal-Bench-Science?
Terminal-Bench-Science (TB-Science) is a benchmark for evaluating AI agents on the complex real-world computational workflows that natural scientists run in their research labs. It builds on the success of Terminal-Bench, adopted by frontier labs such as OpenAI, Anthropic, and Google DeepMind, which helped drive rapid progress in AI coding agents by defining what leading labs measure and optimize for. No equivalent exists for science — until now.
Current "AI for Science" benchmarks test textbook knowledge or abstract capabilities like hypothesis generation. They do not measure whether an AI system can execute the end-to-end computational workflows that drive modern research in the natural sciences. TB-Science will close this gap by porting real workflows from leading research labs into executable benchmark tasks, evaluated in containerized environments with deterministic, programmatic verification.
Our goal is to catalyze a "Claude Code / Codex for Science" moment by giving natural scientists a direct voice in shaping AI progress: domain experts contribute real workflows, frontier labs optimize against them, and the resulting advances flow back as more capable AI tools for scientific discovery, creating a virtuous cycle between the scientists who know what matters and the labs building the next generation of AI.
domain experts contribute complex real-
world scientific workflows as tasks
tasks are used to evaluate and
rank frontier AI agents/models
frontier labs invest in improving scientific
capabilities of their agents/models
improved agents/models
accelerate scientific research
VIRTUOUS CYCLE OF
AI FOR SCIENCE
PROGRESS
Domains
TB-Science is targeting 100+ benchmark tasks across the natural sciences, spanning the life sciences, physical sciences, earth sciences, and mathematical & computational sciences:
| Domain | Areas |
|---|---|
| Life Sciences | Biology, Medicine, Neuroscience |
| Physical Sciences | Astronomy, Chemistry & Materials, Physics |
| Earth Sciences | Atmospheric Science, Geoscience, Ocean Science |
| Mathematical & Computational Sciences | Applied Mathematics, Scientific Computing, Data Science & Statistics |
Timeline
- Q1 2026 — Project launch, initial task collection and review
- Q2 2026 — Open contribution call, extensive task collection and review, evaluation runs
- Q3 2026 — Public release and leaderboard, paper submission
Contribute
What We're Looking For
We're looking for complex, real-world computational workflows from practicing scientists across the natural sciences — including biology, chemistry, physics, earth sciences, neuroscience, medicine, and scientific computing. Each task should meet three key criteria:
-
Scientifically grounded. Tasks should be drawn directly from real research workflows, not toy problems or textbook exercises. The best tasks are ones from your own research: data analysis pipelines, simulation setups, numerical solvers, model fitting, instrument data processing, image analysis, signal processing or other computational challenges you've had to build, run, debug, or solve.
-
Objectively verifiable. Every task must have concrete, checkable outputs, such as numerical results, generated files, statistical fits, or reproducible data. We are not looking for open-ended tasks like hypothesis generation or literature review. Our goal is to drive AI progress toward a reliable scientific assistant, not to replace scientists in the creative and intellectual aspects of research.
-
Genuinely difficult. We want tasks that today's best AI models and agents cannot yet reliably solve. Easy tasks don't drive progress. Hard tasks are what expose real gaps and push AI capabilities forward. Our target is for frontier models to complete only 10–20% of tasks at release, keeping the benchmark at the cutting edge of AI for Science capability.
How to Contribute
We welcome contributions to Terminal-Bench-Science! To maintain quality standards, we follow a curated contribution process:
-
Connect — Join our Discord, introduce yourself in #tb-science, and optionally pitch your task idea in #tb-science-task-ideas to get quick feedback before investing in a full proposal. Follow #tb-science-announcements for updates and weekly meetings (Mondays, 11am PT).
-
Propose — Submit your task idea through our official Task Proposal Form. Our science team will review your proposal for scientific rigor, authenticity, and alignment with the benchmark's scope and standards. All submitted proposals are posted publicly on our Task Proposal Board and in #tb-science-task-proposals, where they undergo an automatic review followed by final evaluation from a domain expert.
-
Build — Once your proposal is approved, build the task and submit a Pull Request. Our engineering team will review your PR for technical correctness, reproducibility, and adherence to the task format — and will work with you iteratively through PR reviews and feedback to refine your task until it's ready to merge. Need help? Our team is available to support you at every step. See our Contributing Guide for details.
After your task is merged, we run frontier AI agents and models against it and other merged tasks to verify difficulty and calibrate scoring. Based on the results, we'll work with you to finalize your task for inclusion in the official benchmark release.
We're also looking for scientific domain expert reviewers — PIs and senior researchers who can review submitted task proposals in their area of expertise.
Get Involved
Join our Discord server. Key channels: #tb-science for discussion and questions, #tb-science-task-ideas for pitching early task ideas, #tb-science-announcements for important updates, and #tb-science-task-proposals for submitted task proposals and automatic task review summaries. We also have weekly meetings at 11am PT every Monday that you're welcome to join. Get in touch at stevendi@stanford.edu if you want to get involved as a contributor or domain expert reviewer.
Useful links:
- Task Proposal Form — submit your task idea
- Task Proposal Board — browse and discuss proposals
- Contributing Guide — task format, setup, and submission guide
- Discord — #tb-science (discussion), #tb-science-announcements (updates), #tb-science-task-proposals (proposals)
- GitHub — source code and task submissions
- Harbor — run Terminal-Bench evaluations
- Weekly Meeting — 11am PT every Monday
Acknowledgements
Terminal-Bench is part of the Terminal-Bench franchise, hosted by Stanford University and the Laude Institute, and built by the Harbor Framework team and scientific community contributors. We thank the Laude Institute and 2077AI for their generous support via API credits that enable running benchmark evaluations.
Contact
For questions, feedback, or if you're interested in contributing, reach out to Steven Dillmann at stevendi@stanford.edu.
Written by
Terminal-Bench-Science is an open academic collaboration hosted by Stanford University and the Laude Institute.