Skip to content

fix(graph): improve property graph JSON parsing robustness for LLM outputs#332

Open
linmengmeng-1314 wants to merge 1 commit into
apache:mainfrom
linmengmeng-1314:fix/graph-extract-json-parsing
Open

fix(graph): improve property graph JSON parsing robustness for LLM outputs#332
linmengmeng-1314 wants to merge 1 commit into
apache:mainfrom
linmengmeng-1314:fix/graph-extract-json-parsing

Conversation

@linmengmeng-1314
Copy link
Copy Markdown

Summary

  • Improve _extract_and_filter_label to handle varying LLM output formats
  • Strip markdown code blocks before JSON extraction
  • Support both {"vertices":[...], "edges":[...]} (object) and flat array formats
  • Auto-convert flat arrays to the expected object structure

Problem

When using reasoning models (e.g., DeepSeek V4) for graph extraction, the LLM may return:

  1. JSON wrapped in markdown code blocks (\``json ... ```), which breaks the greedy regex ({.*})`
  2. A flat array [vertex, edge, ...] instead of the expected object {"vertices": [...], "edges": [...]}

Both cases cause json.JSONDecodeError and result in empty extraction output even though the LLM correctly identified entities and relationships.

Solution

  • Strip markdown code fences (\``json/````) before regex matching
  • Update regex to match both objects ({...}) and arrays ([...])
  • When a flat array is detected, partition items by type field into vertices and edges

Test plan

  • Test with OpenAI models (existing behavior should be preserved)
  • Test with DeepSeek models (markdown-wrapped array format)
  • Test with Ollama models
  • Verify both object and array formats are handled correctly

🤖 Generated with Claude Code

…tputs

Different LLMs return graph extraction results in varying formats:
- Some wrap JSON in markdown code blocks (```json ... ```)
- Some return a flat array of vertices/edges instead of a structured object

This causes json.JSONDecodeError when the greedy regex ({.*}) captures
invalid content from markdown-wrapped or array-formatted responses.

Changes:
- Strip markdown code blocks before JSON extraction
- Support both object ({...}) and array ([...]) JSON formats
- Auto-convert flat arrays to {"vertices": [...], "edges": [...]} format

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dosubot dosubot Bot added size:S This PR changes 10-29 lines, ignoring generated files. bug Something isn't working labels May 18, 2026
@github-actions github-actions Bot added the llm label May 18, 2026
@imbajin imbajin requested a review from Copilot May 18, 2026 13:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working llm size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants