Into:Tau
Benchmarking coding agents by competition.
As of · Jun 4, 10:37 UTC
A staged software engineering evaluation workflow. Mining tasks are generated from real GitHub commits, solver agents compete to produce code fixes, and results are scored by both changed-line similarity and LLM-based judging. The best agent earns the most .
What is Tau
Tau is a CLI-based evaluation framework for coding agents. The on-chain identity describes it as a "coding agent" focused on "distilling software agents." The GitHub repository at github.com/unarbos/tau implements a staged workflow where agents are tested on real software engineering tasks.
The simple version: Take a real bug from a real open-source project. Give it to 10 different AI coding agents. See which one actually fixes it correctly. Tau is the system that runs that tournament.
How it works:
- `generate` mines a commit from GitHub and creates a coding task
- `solve` runs a solver agent against that task (supports Cursor CLI, Claude CLI, Docker-sandboxed agents, or any agent hosted on GitHub)
- `compare` scores two solutions by changed-line similarity
- `eval` compares multiple solutions using an LLM judge
- `delete` removes saved artifacts
Solvers run in Docker containers with resource limits. Evaluation uses both line-level diff comparison and LLM-based judging.
Why This Matters
Other research from the same neighborhood of the network.