77</p >
88
99<p align =" center " ><strong >Turn AI coding runs into portable, replayable, benchmark-ready task bundles.</strong ></p >
10- <p align =" center " >The missing middle layer between raw chat logs and heavyweight benchmark platforms.</p >
10+ <p align =" center " >A practical format between raw chat logs and heavyweight benchmark platforms.</p >
1111<p align =" center " >
1212 <a href =" #quickstart " ><strong >Quick Start</strong ></a > ·
1313 <a href =" #real-bundles " ><strong >Real Output</strong ></a > ·
14+ <a href =" #format-vs-alternatives " ><strong >Why This Format</strong ></a > ·
1415 <a href =" ./docs/bundle-format.md " ><strong >Bundle Format</strong ></a > ·
16+ <a href =" ./docs/sample-benchmark-report.md " ><strong >Sample Report</strong ></a > ·
1517 <a href =" ./ROADMAP.md " ><strong >Roadmap</strong ></a > ·
1618 <a href =" ./docs/branding.md " ><strong >Brand Assets</strong ></a >
1719</p >
@@ -24,13 +26,13 @@ Task Bundle is a TypeScript + Node.js CLI for teams building agents, evals, codi
2426
2527Package a task once, inspect it later, compare tools on the same starting point, and generate benchmark-style reports from real artifacts.
2628
27- Why people star it :
28- - turn one AI coding run into a clean, shareable directory instead of a screenshot, transcript , or loose patch
29- - compare Codex, Claude Code, Cursor, or internal agents with real metadata, hashes, and outcome fields
30- - generate benchmark-style reports from a folder of bundles without building a full evaluation platform first
31- - keep replay grounded in re-execution and comparison, not token-by-token theater
29+ It helps you :
30+ - turn one AI coding run into a clean, shareable directory instead of leaving it scattered across screenshots, transcripts , or loose patches
31+ - compare Codex, Claude Code, Cursor, or internal agents using metadata, hashes, and outcome fields
32+ - generate benchmark-style reports from a folder of bundles without standing up a full evaluation platform first
33+ - preserve enough context for reruns and comparisons without requiring token-perfect recording
3234
33- If you've ever wanted a format between "a raw chat log" and "a full benchmark platform", this project is that missing middle layer .
35+ It fits the gap between raw logs and full evaluation systems: light enough for day-to-day work, structured enough for replay and benchmarking .
3436
3537It is designed for workflows where you want to:
3638- inspect what happened
@@ -60,6 +62,8 @@ npm run dev -- compare ./examples/hello-world-bundle ./examples/hello-world-bund
6062
6163If you want the shortest possible proof that the project already works, this is it.
6264
65+ ![ Task Bundle workflow overview] ( ./assets/workflow-overview.svg )
66+
6367<a id =" real-bundles " ></a >
6468
6569## See It On Real Bundles
@@ -106,6 +110,22 @@ Ranking
1061102. Fix greeting punctuation | claude-code / claude-sonnet-4 | success | score 0.89
107111```
108112
113+ Browse the committed example report:
114+ - [ docs/sample-benchmark-report.md] ( ./docs/sample-benchmark-report.md )
115+ - [ docs/sample-benchmark-report.zh-CN.md] ( ./docs/sample-benchmark-report.zh-CN.md )
116+
117+ <a id =" format-vs-alternatives " ></a >
118+
119+ ## How It Compares To Common Alternatives
120+
121+ | Need | Chat logs | Zip or tarball | Full benchmark platform | Task Bundle |
122+ | --- | --- | --- | --- | --- |
123+ | Share the original task and result together | Partial | Yes | Yes | Yes |
124+ | Compare different tools on the same starting point | Weak | Manual | Yes | Yes |
125+ | Carry artifact hashes and outcome metadata | No | No | Yes | Yes |
126+ | Stay lightweight enough for everyday coding workflows | Yes | Yes | No | Yes |
127+ | Grow into replay and benchmark workflows later | Weak | Weak | Yes | Yes |
128+
109129## Why It Matters
110130
111131Most AI coding work disappears into screenshots, transcripts, or one-off patches.
@@ -114,7 +134,7 @@ Task Bundle gives you a durable unit you can inspect, archive, compare, validate
114134- agent builders who want reproducible tasks
115135- eval and benchmark authors who need structured task artifacts
116136- teams comparing Codex, Claude Code, Cursor, or custom tools
117- - researchers who care about re-execution instead of token-by-token theater
137+ - researchers who care about re-execution over token-perfect replay
118138
119139## What Replay Means Here
120140
@@ -323,6 +343,8 @@ They represent the same task captured from different tool/model combinations so
323343
324344You can also point ` taskbundle report ` at the same directory to generate a small benchmark-style leaderboard.
325345
346+ For a committed snapshot of that output, see [ docs/sample-benchmark-report.md] ( ./docs/sample-benchmark-report.md ) .
347+
326348## Bundle Format At A Glance
327349
328350- ` bundle.json ` : top-level metadata and artifact pointers
0 commit comments