Skip to content

Commit ac478fb

Browse files
committed
Polish public docs and demo assets
1 parent 199f4e2 commit ac478fb

File tree

8 files changed

+218
-17
lines changed

8 files changed

+218
-17
lines changed

README.md

Lines changed: 30 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,13 @@
77
</p>
88

99
<p align="center"><strong>Turn AI coding runs into portable, replayable, benchmark-ready task bundles.</strong></p>
10-
<p align="center">The missing middle layer between raw chat logs and heavyweight benchmark platforms.</p>
10+
<p align="center">A practical format between raw chat logs and heavyweight benchmark platforms.</p>
1111
<p align="center">
1212
<a href="#quickstart"><strong>Quick Start</strong></a> ·
1313
<a href="#real-bundles"><strong>Real Output</strong></a> ·
14+
<a href="#format-vs-alternatives"><strong>Why This Format</strong></a> ·
1415
<a href="./docs/bundle-format.md"><strong>Bundle Format</strong></a> ·
16+
<a href="./docs/sample-benchmark-report.md"><strong>Sample Report</strong></a> ·
1517
<a href="./ROADMAP.md"><strong>Roadmap</strong></a> ·
1618
<a href="./docs/branding.md"><strong>Brand Assets</strong></a>
1719
</p>
@@ -24,13 +26,13 @@ Task Bundle is a TypeScript + Node.js CLI for teams building agents, evals, codi
2426

2527
Package a task once, inspect it later, compare tools on the same starting point, and generate benchmark-style reports from real artifacts.
2628

27-
Why people star it:
28-
- turn one AI coding run into a clean, shareable directory instead of a screenshot, transcript, or loose patch
29-
- compare Codex, Claude Code, Cursor, or internal agents with real metadata, hashes, and outcome fields
30-
- generate benchmark-style reports from a folder of bundles without building a full evaluation platform first
31-
- keep replay grounded in re-execution and comparison, not token-by-token theater
29+
It helps you:
30+
- turn one AI coding run into a clean, shareable directory instead of leaving it scattered across screenshots, transcripts, or loose patches
31+
- compare Codex, Claude Code, Cursor, or internal agents using metadata, hashes, and outcome fields
32+
- generate benchmark-style reports from a folder of bundles without standing up a full evaluation platform first
33+
- preserve enough context for reruns and comparisons without requiring token-perfect recording
3234

33-
If you've ever wanted a format between "a raw chat log" and "a full benchmark platform", this project is that missing middle layer.
35+
It fits the gap between raw logs and full evaluation systems: light enough for day-to-day work, structured enough for replay and benchmarking.
3436

3537
It is designed for workflows where you want to:
3638
- inspect what happened
@@ -60,6 +62,8 @@ npm run dev -- compare ./examples/hello-world-bundle ./examples/hello-world-bund
6062

6163
If you want the shortest possible proof that the project already works, this is it.
6264

65+
![Task Bundle workflow overview](./assets/workflow-overview.svg)
66+
6367
<a id="real-bundles"></a>
6468

6569
## See It On Real Bundles
@@ -106,6 +110,22 @@ Ranking
106110
2. Fix greeting punctuation | claude-code / claude-sonnet-4 | success | score 0.89
107111
```
108112

113+
Browse the committed example report:
114+
- [docs/sample-benchmark-report.md](./docs/sample-benchmark-report.md)
115+
- [docs/sample-benchmark-report.zh-CN.md](./docs/sample-benchmark-report.zh-CN.md)
116+
117+
<a id="format-vs-alternatives"></a>
118+
119+
## How It Compares To Common Alternatives
120+
121+
| Need | Chat logs | Zip or tarball | Full benchmark platform | Task Bundle |
122+
| --- | --- | --- | --- | --- |
123+
| Share the original task and result together | Partial | Yes | Yes | Yes |
124+
| Compare different tools on the same starting point | Weak | Manual | Yes | Yes |
125+
| Carry artifact hashes and outcome metadata | No | No | Yes | Yes |
126+
| Stay lightweight enough for everyday coding workflows | Yes | Yes | No | Yes |
127+
| Grow into replay and benchmark workflows later | Weak | Weak | Yes | Yes |
128+
109129
## Why It Matters
110130

111131
Most AI coding work disappears into screenshots, transcripts, or one-off patches.
@@ -114,7 +134,7 @@ Task Bundle gives you a durable unit you can inspect, archive, compare, validate
114134
- agent builders who want reproducible tasks
115135
- eval and benchmark authors who need structured task artifacts
116136
- teams comparing Codex, Claude Code, Cursor, or custom tools
117-
- researchers who care about re-execution instead of token-by-token theater
137+
- researchers who care about re-execution over token-perfect replay
118138

119139
## What Replay Means Here
120140

@@ -323,6 +343,8 @@ They represent the same task captured from different tool/model combinations so
323343

324344
You can also point `taskbundle report` at the same directory to generate a small benchmark-style leaderboard.
325345

346+
For a committed snapshot of that output, see [docs/sample-benchmark-report.md](./docs/sample-benchmark-report.md).
347+
326348
## Bundle Format At A Glance
327349

328350
- `bundle.json`: top-level metadata and artifact pointers

README.zh-CN.md

Lines changed: 31 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,13 @@
77
</p>
88

99
<p align="center"><strong>把 AI coding 过程变成可分享、可重跑、可比较、可做 benchmark 的任务包。</strong></p>
10-
<p align="center">它正好补上“聊天记录太散、benchmark 平台太重”之间缺失的那层基础设施。</p>
10+
<p align="center">它适合放在聊天记录和 benchmark 平台之间,承接真实任务与结果。</p>
1111
<p align="center">
1212
<a href="#quickstart"><strong>快速开始</strong></a> ·
1313
<a href="#real-bundles"><strong>真实输出</strong></a> ·
14+
<a href="#format-vs-alternatives"><strong>为什么是这个格式</strong></a> ·
1415
<a href="./docs/bundle-format.zh-CN.md"><strong>格式说明</strong></a> ·
16+
<a href="./docs/sample-benchmark-report.zh-CN.md"><strong>示例报告</strong></a> ·
1517
<a href="./ROADMAP.zh-CN.md"><strong>路线图</strong></a> ·
1618
<a href="./docs/branding.zh-CN.md"><strong>品牌素材</strong></a>
1719
</p>
@@ -22,15 +24,15 @@
2224

2325
Task Bundle 是一个 TypeScript + Node.js CLI,适合 agent、eval、benchmark、可复现实验这类工作流。
2426

25-
一次打包,之后就能 inspect、compare、validate、report,也能把不同工具放到同一起点上做更公平的对照
27+
把一次运行整理好之后,就可以 inspect、compare、validate、report,也方便把不同工具放到同一起点上做对照
2628

27-
大家愿意给它点 star,通常是因为这些点
28-
- 它能把一次 AI coding 任务整理成干净、稳定、可搬运的目录,而不是散落的截图、聊天记录或 patch
29-
- 它能比较 Codex、Claude Code、Cursor 或内部工具的结果,而且比较依据是真实元数据、哈希和 outcome 字段
30-
- 它能从一组 bundle 直接生成 benchmark 风格报告,不用一开始就搭完整评测平台
31-
- 它对 replay 的理解更务实,强调“可重跑、可比较”,而不是追求逐 token 复刻的表演感
29+
它主要解决这些问题
30+
- 把一次 AI coding 任务整理成干净、稳定、可搬运的目录,而不是散落在截图、聊天记录或 patch
31+
- 比较 Codex、Claude Code、Cursor 或内部工具的结果,而且比较依据包括元数据、哈希和 outcome 字段
32+
- 从一组 bundle 直接生成 benchmark 风格报告,不用先搭完整评测平台
33+
- 为后续重跑和比较保留足够上下文,而不是依赖逐 token 录制
3234

33-
如果你一直觉得“聊天记录太散、benchmark 平台太重”,这个项目就是中间那层缺失的基础设施
35+
它适合放在“聊天记录不够稳”和“完整 benchmark 平台太重”之间,作为更轻但足够结构化的方案
3436

3537
它适合这些场景:
3638
- 查看一次任务最后到底做了什么
@@ -60,6 +62,8 @@ npm run dev -- compare ./examples/hello-world-bundle ./examples/hello-world-bund
6062

6163
如果你只想先确认“这项目现在到底能不能用”,这组命令就是最短路径。
6264

65+
![Task Bundle workflow overview](./assets/workflow-overview.svg)
66+
6367
<a id="real-bundles"></a>
6468

6569
## 看看真实输出
@@ -106,6 +110,22 @@ Ranking
106110
2. Fix greeting punctuation | claude-code / claude-sonnet-4 | success | score 0.89
107111
```
108112

113+
你也可以直接点开仓库里提交好的示例报告:
114+
- [docs/sample-benchmark-report.zh-CN.md](./docs/sample-benchmark-report.zh-CN.md)
115+
- [docs/sample-benchmark-report.md](./docs/sample-benchmark-report.md)
116+
117+
<a id="format-vs-alternatives"></a>
118+
119+
## 和常见替代方案怎么区分
120+
121+
| 需求 | 聊天记录 | Zip / tarball | 完整 benchmark 平台 | Task Bundle |
122+
| --- | --- | --- | --- | --- |
123+
| 把原始任务和最终结果放在一起分享 | 部分满足 | 可以 | 可以 | 可以 |
124+
| 在同一起点上比较不同工具 | 很弱 | 很靠手工 | 可以 | 可以 |
125+
| 携带 artifact 哈希和结果元数据 | 不行 | 不行 | 可以 | 可以 |
126+
| 足够轻,能融入日常 coding 工作流 | 可以 | 可以 | 不太行 | 可以 |
127+
| 之后继续长成 replay / benchmark 工作流 | 很弱 | 很弱 | 可以 | 可以 |
128+
109129
## 为什么值得关注
110130

111131
很多 AI coding 结果最后只留下截图、聊天记录或者一个 patch,后续几乎没法稳定比较。
@@ -114,7 +134,7 @@ Task Bundle 想解决的就是这个空档:把一次任务变成一个可以 i
114134
- 想做可复现实验的 agent 作者
115135
- 想做任务评测和 benchmark 的团队
116136
- 想比较 Codex、Claude Code、Cursor 或内部工具的开发者
117-
- 不想被“逐 token 回放”误导、而是真正在意可重跑的人
137+
- 更关心可重跑,而不是逐 token 回放一致性的人
118138

119139
## 这里的 Replay 是什么意思
120140

@@ -323,6 +343,8 @@ npm run dev -- report ./examples --out ./dist/benchmark-report.md
323343

324344
你也可以直接把这个目录交给 `taskbundle report`,生成一份小型 benchmark 排行榜。
325345

346+
如果你想直接看一份已提交到仓库里的报告快照,可以打开 [docs/sample-benchmark-report.zh-CN.md](./docs/sample-benchmark-report.zh-CN.md)
347+
326348
## Bundle 格式一眼看懂
327349

328350
- `bundle.json`:顶层元数据和 artifact 指针

assets/workflow-overview.svg

Lines changed: 82 additions & 0 deletions
Loading

docs/branding.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ The art direction is intentionally warm-editorial rather than generic SaaS gradi
88

99
- `assets/hero-banner.svg`
1010
Embedded at the top of the README to make the repository landing page feel like a product, not just a package listing.
11+
- `assets/workflow-overview.svg`
12+
A second README visual that explains the capture -> inspect -> compare -> report loop in one glance.
1113
- `assets/social-preview.svg`
1214
Source artwork for GitHub social preview uploads.
1315
- `assets/social-preview.png`

docs/branding.zh-CN.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ Task Bundle 在 `assets/` 目录下提供了一套可直接用于仓库展示的
66

77
- `assets/hero-banner.svg`
88
中英文 README 顶部使用的主视觉横幅,可继续编辑。
9+
- `assets/workflow-overview.svg`
10+
README 里的第二张主视觉,用来一眼解释 capture -> inspect -> compare -> report 这条路径。
911
- `assets/social-preview.svg`
1012
GitHub 社交预览图的可编辑源文件。
1113
- `assets/social-preview.png`

docs/sample-benchmark-report.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Sample Benchmark Report
2+
3+
This page shows what `taskbundle report` looks like against the example bundles included in this repository.
4+
5+
## Regenerate Locally
6+
7+
```bash
8+
npm run dev -- report ./examples --out ./dist/benchmark-report.md
9+
```
10+
11+
## Snapshot
12+
13+
- Bundles: 2
14+
- Scored bundles: 2
15+
- Average score: 0.91
16+
17+
## Ranking
18+
19+
| Rank | Title | Tool | Model | Status | Score | Events | Workspace |
20+
| --- | --- | --- | --- | --- | --- | --- | --- |
21+
| 1 | Fix greeting punctuation | codex | gpt-5 | success | 0.93 | 3 | 1 |
22+
| 2 | Fix greeting punctuation | claude-code | claude-sonnet-4 | success | 0.89 | 4 | 1 |
23+
24+
## Leaderboard By Tool/Model
25+
26+
| Tool | Model | Runs | Scored | Successes | Avg Score | Best Score |
27+
| --- | --- | --- | --- | --- | --- | --- |
28+
| codex | gpt-5 | 1 | 1 | 1 | 0.93 | 0.93 |
29+
| claude-code | claude-sonnet-4 | 1 | 1 | 1 | 0.89 | 0.89 |
30+
31+
## Why This Matters
32+
33+
- It gives the repo a benchmark-shaped artifact without forcing a full benchmark platform.
34+
- It shows that the example bundles are not toy files with no downstream use.
35+
- It makes cross-tool comparisons legible for humans before you build dashboards.
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# 示例 Benchmark 报告
2+
3+
这个页面展示的是:把仓库自带的 example bundles 交给 `taskbundle report` 之后,大概会得到什么样的结果。
4+
5+
## 本地重新生成
6+
7+
```bash
8+
npm run dev -- report ./examples --out ./dist/benchmark-report.md
9+
```
10+
11+
## 示例快照
12+
13+
- Bundles: 2
14+
- Scored bundles: 2
15+
- Average score: 0.91
16+
17+
## 排名
18+
19+
| Rank | Title | Tool | Model | Status | Score | Events | Workspace |
20+
| --- | --- | --- | --- | --- | --- | --- | --- |
21+
| 1 | Fix greeting punctuation | codex | gpt-5 | success | 0.93 | 3 | 1 |
22+
| 2 | Fix greeting punctuation | claude-code | claude-sonnet-4 | success | 0.89 | 4 | 1 |
23+
24+
## Tool / Model 排行
25+
26+
| Tool | Model | Runs | Scored | Successes | Avg Score | Best Score |
27+
| --- | --- | --- | --- | --- | --- | --- |
28+
| codex | gpt-5 | 1 | 1 | 1 | 0.93 | 0.93 |
29+
| claude-code | claude-sonnet-4 | 1 | 1 | 1 | 0.89 | 0.89 |
30+
31+
## 为什么这个页面有价值
32+
33+
- 它让仓库直接具备一个 benchmark 风格的可见成果,不需要先做完整平台。
34+
- 它说明 example bundles 不是摆设,而是真的可以继续拿来分析和比较。
35+
- 它让“跨工具比较”在没有 dashboard 之前,也已经足够清楚可读。

package.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@
3333
"docs",
3434
"assets/hero-banner.svg",
3535
"assets/social-preview.svg",
36+
"assets/workflow-overview.svg",
3637
"templates",
3738
"LICENSE",
3839
"README.md",

0 commit comments

Comments
 (0)