11# ADE DocVQA Benchmark
22
3- ** 98.501 % Accuracy on DocVQA Validation Set**
3+ ** 98.650 % Accuracy on DocVQA Validation Set**
44
55This repository contains our complete DocVQA benchmark implementation using Agentic Document Extraction (ADE) with DPT-2 parsing and Claude for question answering.
66
77## 🎯 Results
88
9- - ** Accuracy:** 98.501 % (5,257 /5,337 correct, excluding questionable entries )
9+ - ** Accuracy:** 98.650 % (5,263 /5,335 correct, excluding 14 dataset issues )
1010- ** Baseline:** 95.36% (with Playground Chat)
11- - ** Improvement:** +3.14 percentage points
12- - ** Remaining errors:** 80 non-questionable cases
11+ - ** Improvement:** +3.29 percentage points
12+ - ** Remaining errors:** 72 real errors (14 dataset issues excluded but visible)
1313
1414[ ** View Interactive Error Gallery →** ] ( ./gallery.html )
1515
16+ The gallery includes:
17+ - 40 hardest success cases with visual grounding
18+ - 72 error cases with detailed analysis
19+ - 14 questionable dataset issues (excluded from accuracy but shown for transparency)
20+ - Interactive category filtering
21+
1622## 📁 Repository Contents
1723
1824### Main Files
@@ -133,33 +139,35 @@ Key improvements:
133139
134140## 📈 Performance Breakdown
135141
136- ### By Error Category (80 remaining errors)
142+ ### By Error Category (72 real errors)
137143
138- | Category | Count | % of Errors |
139- | ----------| -------| -------------|
140- | Downstream LLM errors | 38 | 47.5% |
141- | Missed Parse | 22 | 27.5% |
142- | OCR errors | 13 | 16.3% |
143- | Dataset issues | 7 | 8.8% |
144+ | Category | Count | % of Errors | Description |
145+ | ----------| -------| -------------| -------------|
146+ | Incorrect Parse | 30 | 41.7% | OCR/parsing errors (character confusion, misreads) |
147+ | Prompt/LLM Misses | 18 | 25.0% | Reasoning or interpretation failures |
148+ | Not ADE Focus | 15 | 20.8% | Spatial layout questions outside ADE's core strength |
149+ | Missed Parse | 9 | 12.5% | Information not extracted during parsing |
150+ | ** Dataset Issues** | ** 14** | ** —** | ** Questionable ground truth (excluded from count)** |
144151
145- ### Error Types
152+ ### Error Categories Explained
146153
147- - ** LLM errors:** Reasoning, interpretation, or spatial understanding issues
148- - ** Missed Parse:** Information not extracted during parsing
149- - ** OCR errors:** Character-level recognition mistakes (O/0, I/l/1, etc.)
150- - ** Dataset issues:** Questionable ground truth or ambiguous questions
154+ - ** Incorrect Parse:** OCR/parsing mistakes like character confusion (O/0, I/l/1), table misreads, or parsing artifacts
155+ - ** Prompt/LLM Misses:** Claude gives wrong answer despite having correct parsed data - reasoning or instruction-following issues
156+ - ** Not ADE Focus:** Questions requiring visual layout analysis, spatial reasoning, or document structure understanding beyond text extraction
157+ - ** Missed Parse:** Information exists in document but wasn't extracted by the parser
158+ - ** Dataset Issues:** Questionable annotations, ambiguous questions, or debatable ground truth (excluded from accuracy calculation)
151159
152160## 🛠️ Model Configuration
153161
154- ** Recommended (used for 98.501 % result):**
162+ ** Recommended (used for 98.650 % result):**
155163- Model: ` claude-sonnet-4-20250514 ` (Sonnet 4.5)
156164- Temperature: 0.0
157165- Max tokens: 4096
158166- Cost: ~ $10 for full evaluation
159167
160168** Alternative:**
161169- Model: ` claude-opus-4-20250514 ` (Opus 4)
162- - Slightly lower accuracy (98.28%) but stronger reasoning
170+ - Slightly lower accuracy but stronger reasoning
163171- Cost: ~ $100 for full evaluation
164172
165173## 📂 Alternative Prompts
@@ -172,12 +180,13 @@ See `extra/prompts/` for previous iterations:
172180
173181## 🔬 Reproducing Results
174182
175- To exactly reproduce our 98.501 % result:
183+ To exactly reproduce our 98.650 % result:
176184
1771851 . Use the provided ` parsed/ ` documents (same parsing output)
1781862 . Use ` prompt.md ` (final hybrid prompt)
1791873 . Use Claude Sonnet 4.5 (` claude-sonnet-4-20250514 ` )
1801884 . Temperature 0.0 (deterministic)
189+ 5 . Exclude 14 dataset issues from accuracy calculation (as documented in gallery)
181190
182191## 📝 Citation
183192
@@ -194,14 +203,14 @@ If you use this benchmark or methodology, please cite:
194203
195204## 🤝 Contributing
196205
197- We welcome contributions! Areas for improvement:
206+ We welcome contributions! Priority areas for improvement:
198207
199- - ** EXTRACTION queries ** - Still struggling (low success rate )
200- - ** VISUAL queries ** - Need better visual reasoning
201- - ** Confidence calibration ** - Model overconfident on errors
202- - ** Specialized prompts ** - Different strategies per question type
208+ - ** Parsing quality ** - 30 incorrect parse errors (41.7% of failures )
209+ - ** Not ADE Focus ** - 15 spatial layout questions (20.8% of failures)
210+ - ** Prompt engineering ** - 18 LLM reasoning errors (25.0% of failures)
211+ - ** Information extraction ** - 9 missed parse errors (12.5% of failures)
203212
204- See ` extra/reports/ ` for detailed analysis of remaining challenges .
213+ See the [ interactive gallery ] ( ./gallery.html ) for detailed error analysis with visual grounding and category filtering .
205214
206215## 📄 License
207216
0 commit comments