Skip to content

Commit ed3947f

Browse files
Ankit Khareclaude
authored andcommitted
Update README with 98.650% accuracy and refined error analysis
Updated all statistics and added detailed breakdown of 5 error categories with improved transparency around dataset issues. 🤖 Generated with Claude Code Co-Authored-By: Claude <[email protected]>
1 parent ec025c5 commit ed3947f

File tree

1 file changed

+34
-25
lines changed

1 file changed

+34
-25
lines changed

README.md

Lines changed: 34 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,24 @@
11
# ADE DocVQA Benchmark
22

3-
**98.501% Accuracy on DocVQA Validation Set**
3+
**98.650% Accuracy on DocVQA Validation Set**
44

55
This repository contains our complete DocVQA benchmark implementation using Agentic Document Extraction (ADE) with DPT-2 parsing and Claude for question answering.
66

77
## 🎯 Results
88

9-
- **Accuracy:** 98.501% (5,257/5,337 correct, excluding questionable entries)
9+
- **Accuracy:** 98.650% (5,263/5,335 correct, excluding 14 dataset issues)
1010
- **Baseline:** 95.36% (with Playground Chat)
11-
- **Improvement:** +3.14 percentage points
12-
- **Remaining errors:** 80 non-questionable cases
11+
- **Improvement:** +3.29 percentage points
12+
- **Remaining errors:** 72 real errors (14 dataset issues excluded but visible)
1313

1414
[**View Interactive Error Gallery →**](./gallery.html)
1515

16+
The gallery includes:
17+
- 40 hardest success cases with visual grounding
18+
- 72 error cases with detailed analysis
19+
- 14 questionable dataset issues (excluded from accuracy but shown for transparency)
20+
- Interactive category filtering
21+
1622
## 📁 Repository Contents
1723

1824
### Main Files
@@ -133,33 +139,35 @@ Key improvements:
133139

134140
## 📈 Performance Breakdown
135141

136-
### By Error Category (80 remaining errors)
142+
### By Error Category (72 real errors)
137143

138-
| Category | Count | % of Errors |
139-
|----------|-------|-------------|
140-
| Downstream LLM errors | 38 | 47.5% |
141-
| Missed Parse | 22 | 27.5% |
142-
| OCR errors | 13 | 16.3% |
143-
| Dataset issues | 7 | 8.8% |
144+
| Category | Count | % of Errors | Description |
145+
|----------|-------|-------------|-------------|
146+
| Incorrect Parse | 30 | 41.7% | OCR/parsing errors (character confusion, misreads) |
147+
| Prompt/LLM Misses | 18 | 25.0% | Reasoning or interpretation failures |
148+
| Not ADE Focus | 15 | 20.8% | Spatial layout questions outside ADE's core strength |
149+
| Missed Parse | 9 | 12.5% | Information not extracted during parsing |
150+
| **Dataset Issues** | **14** | **** | **Questionable ground truth (excluded from count)** |
144151

145-
### Error Types
152+
### Error Categories Explained
146153

147-
- **LLM errors:** Reasoning, interpretation, or spatial understanding issues
148-
- **Missed Parse:** Information not extracted during parsing
149-
- **OCR errors:** Character-level recognition mistakes (O/0, I/l/1, etc.)
150-
- **Dataset issues:** Questionable ground truth or ambiguous questions
154+
- **Incorrect Parse:** OCR/parsing mistakes like character confusion (O/0, I/l/1), table misreads, or parsing artifacts
155+
- **Prompt/LLM Misses:** Claude gives wrong answer despite having correct parsed data - reasoning or instruction-following issues
156+
- **Not ADE Focus:** Questions requiring visual layout analysis, spatial reasoning, or document structure understanding beyond text extraction
157+
- **Missed Parse:** Information exists in document but wasn't extracted by the parser
158+
- **Dataset Issues:** Questionable annotations, ambiguous questions, or debatable ground truth (excluded from accuracy calculation)
151159

152160
## 🛠️ Model Configuration
153161

154-
**Recommended (used for 98.501% result):**
162+
**Recommended (used for 98.650% result):**
155163
- Model: `claude-sonnet-4-20250514` (Sonnet 4.5)
156164
- Temperature: 0.0
157165
- Max tokens: 4096
158166
- Cost: ~$10 for full evaluation
159167

160168
**Alternative:**
161169
- Model: `claude-opus-4-20250514` (Opus 4)
162-
- Slightly lower accuracy (98.28%) but stronger reasoning
170+
- Slightly lower accuracy but stronger reasoning
163171
- Cost: ~$100 for full evaluation
164172

165173
## 📂 Alternative Prompts
@@ -172,12 +180,13 @@ See `extra/prompts/` for previous iterations:
172180

173181
## 🔬 Reproducing Results
174182

175-
To exactly reproduce our 98.501% result:
183+
To exactly reproduce our 98.650% result:
176184

177185
1. Use the provided `parsed/` documents (same parsing output)
178186
2. Use `prompt.md` (final hybrid prompt)
179187
3. Use Claude Sonnet 4.5 (`claude-sonnet-4-20250514`)
180188
4. Temperature 0.0 (deterministic)
189+
5. Exclude 14 dataset issues from accuracy calculation (as documented in gallery)
181190

182191
## 📝 Citation
183192

@@ -194,14 +203,14 @@ If you use this benchmark or methodology, please cite:
194203

195204
## 🤝 Contributing
196205

197-
We welcome contributions! Areas for improvement:
206+
We welcome contributions! Priority areas for improvement:
198207

199-
- **EXTRACTION queries** - Still struggling (low success rate)
200-
- **VISUAL queries** - Need better visual reasoning
201-
- **Confidence calibration** - Model overconfident on errors
202-
- **Specialized prompts** - Different strategies per question type
208+
- **Parsing quality** - 30 incorrect parse errors (41.7% of failures)
209+
- **Not ADE Focus** - 15 spatial layout questions (20.8% of failures)
210+
- **Prompt engineering** - 18 LLM reasoning errors (25.0% of failures)
211+
- **Information extraction** - 9 missed parse errors (12.5% of failures)
203212

204-
See `extra/reports/` for detailed analysis of remaining challenges.
213+
See the [interactive gallery](./gallery.html) for detailed error analysis with visual grounding and category filtering.
205214

206215
## 📄 License
207216

0 commit comments

Comments
 (0)