Conversation
Four-stage pipeline for generating UBERON new term request ROBOT templates from HRA ASCTB unmapped term tables: Stage 1 (generate_template.py): reads xlsx/csv input, classifies parent IDs (UBERON/FMA/ASCTB-TEMP), assigns UBERON:99xxxxx provisional IDs, writes initial ROBOT template TSV + error and candidate reports. Stage 2 (group_terms_by_parent.py): groups template rows by parent and writes per-group JSON files for parallel subagent processing. Stage 3 (ntr-term-researcher agent): resolves FMA/ASCTB-TEMP parents via OLS4, checks for existing UBERON matches, writes Aristotelian definitions from Wikipedia, resolves is_a vs part_of relationship types. Stage 4 (merge_definitions.py): merges subagent outputs back into the template; appends confirmed/possible OLS4 matches to candidates report. Template columns: ID, LABEL, Definition, def_xref (definition annotation), is_a, part_of, In_subset, Date, Contributor, Present_in_taxon, Wikipedia_image (foaf:depiction), xref (direct oboInOwl:hasDbXref for Wikipedia article URL + FMA ID). Supporting agents/skills: - ntr-term-researcher: Stage 3 subagent spec - ontology-term-lookup: OLS4 structured search - fetch-wiki-info: Wikidata + Wikipedia lookup - .mcp.json: ols4, artl-mcp, playwright MCP servers Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… plans Phases covered: - Phase 2: grouping vs leaf-node term distinction (linguistic rules, subagent behaviour) - Phase 3: detect UBERON label-ID mismatches in Stage 1; new WRONG_PARENT: placeholder; multi-valued parent column splitting; subagent protocol for mismatch correction (informed by ovary run where 7/13 terms had wrong-domain UBERON parent IDs silently accepted) - Phase 4: scale to full muscular-system table - Phase 5: generalise to other ASCTB anatomical systems Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> @dragon-ai-agent
merge_definitions.py: - Fallback path (parent resolved but rel type unknown) now leaves both is_a and part_of blank rather than double-setting them, and lists affected term labels in the summary output under 'Relationship unresolved' for curator attention - Remove dead 'if jf.parent.name == "input"' guard — glob never matches files in subdirectories generate_template.py: - Remove dead write_tsv call with doubled headers that was immediately overwritten by the block below it - Fix counter order: use counter for ID, then increment (was: increment then use counter-1) - Remove hardcoded CONTRIBUTOR_IRI constant; add --contributor CLI arg with ORCID format validation; prompts interactively if not supplied group_terms_by_parent.py: - Remove derive_wikipedia_urls call and wikipedia_urls field from output JSON — parent_label is always "" so the call always returned []; the subagent derives Wikipedia URLs independently during lookup ntr-term-researcher.md: - Clarify that Wikipedia article page URL (not image URL) goes in xrefs at point of successful lookup, as Wikipedia:Article_Title - Add image relevance check: verify caption/alt text confirms the image illustrates the target structure before storing it Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…name flagging
Addresses issues found in the ovary branch test run where the agent:
- classified layers (corpus luteum granulosa lutein/theca) as is_a parents (should be part_of)
- accepted source-provided broad parents instead of finding more specific ones
- left ASCTB-TEMP placeholders as the only def_xref (no real PMIDs)
- did not flag pathological terms (hemorrhagic, luteinized unruptured) as out of scope
- did not normalise non-standard names ('dominance' instead of 'dominant')
ntr-term-researcher.md changes:
- Step 1 expanded: after confirming source parent, agent must search OLS4 for a more
specific parent (e.g. primary/secondary ovarian follicle vs generic ovarian follicle)
- New Step 3: scope check (pathological/dysfunctional → out_of_scope) and name check
(non-standard → name_corrections with curator-reviewable suggestion)
- New Step 5: literature search — must find at least one real PMID/DOI for def_xref;
ASCTB-TEMP placeholders explicitly disallowed as the only reference
- Step 7 (relationship resolution) rewritten with explicit structural vocabulary:
layers, zones, heads, bellies, parts, compartments, walls → ALWAYS part_of
subtypes/stages/members of grouping classes → is_a
Quick test ('is a kind of' vs 'is part of') with worked examples
- Output JSON adds: def_xrefs_to_add, out_of_scope, name_corrections keys
- Quality checks expanded with explicit rules for layers, pathology, naming
merge_definitions.py changes:
- Refactored load_subagent_outputs to return single dict (less argument tuple churn)
- New behaviour: out_of_scope terms excluded from template (not just confirmed_matches);
written to <name>-reports/out_of_scope.tsv for curator review
- New behaviour: name_corrections applied to LABEL column; original-source mapping
written to <name>-reports/name_corrections.tsv
- New behaviour: def_xrefs_to_add appended to def_xref column with deduplication
- Lookup helper accepts both source and corrected labels (agent may key by either)
- Summary output extended with new counters
CLAUDE.md changes:
- Stage 3 description updated to enumerate the new agent responsibilities
- QC checklist extended: real def_xref required, layer/part_of rule, out_of_scope
and name_corrections review steps
- Output Files Reference adds the two new report files
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Surveyed 19 existing UBERON 'muscle of X' terms. 14 (74%) use the simple 'genus + part_of some Y' pattern with UBERON:0014892 (skeletal muscle organ, vertebrate) as genus. 3 use attaches_to_part_of, 2 lack logical definition. Decision gate passed: simple part_of pattern covers majority of existing convention. Phase 2 implementation will support genus + part_of only; attaches_to_part_of, innervated_by, and multi-axiom patterns deferred to future phases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ates generate_template.py now classifies each input row as 'leaf' or 'group' using linguistic regex rules (GROUP_PATTERNS / LEAF_PART_PATTERNS in classify_term_type). - Leaf rows go to <name>.template.tsv with SC/part_of directives (existing) - Group rows go to <name>-groups.template.tsv with EC genus + EC part_of some location directives (new) — genus and location columns left blank for the agent to fill input.tsv gains a term_type column so curators can see the classification. Smoke-tested on muscular-system: 20 group / 55 leaf rows out of 75 input terms, matching ROADMAP prediction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
group_terms_by_parent.py now reads both template_initial.tsv and
template_groups_initial.tsv. Leaf rows are grouped by parent UBERON ID as
before. Grouping rows are pooled into a single 'grouping_terms' bucket since
their genus + location values are agent-determined per term, not shared by a
common parent.
Each per-term entry includes term_type ('leaf' or 'group'). Each per-group
JSON has a term_counts summary so curators can see the leaf/group split.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
merge_definitions.py now merges subagent outputs into both the leaf and groups
templates. Common fields (definitions, images, xrefs, def_xrefs) are applied
identically; logic columns differ:
- Leaf template: resolved_relationships -> is_a/part_of (existing)
- Groups template: group_template_rows[label] -> {genus, location} populates
the EC genus and EC part_of some location columns
Group rows missing the agent's genus+location output are flagged 'EC
incomplete' in the summary so curators can investigate.
New report: manual_curation.tsv lists group terms the agent punted (couldn't
fit the simple genus + part_of some Y pattern); includes proposed definition,
reason, and similar UBERON terms found via obo-grep for curator reference.
Refactored row processing into _apply_common_fields helper plus per-template
merge functions (merge_leaf_template, merge_groups_template) so the two
templates share definition/xref/image logic without duplication.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ia obo-grep
ntr-term-researcher.md updated to handle the leaf/group split introduced by
Stage 1 pre-classification:
- New top-of-file 'Term types' paragraph explaining the leaf vs group split
- Input section documents term_type field, term_counts, GROUPING_TERMS bucket
- Step 6 (Write Definitions) now branches: leaf gets Aristotelian form,
group gets collective form ('A group of muscles that...')
- Step 7 (Resolve Relationship Types) explicitly LEAF-only
- New Step 8 for GROUP terms: use awk over uberon-edit.obo to find similar
group terms; if they use 'genus + part_of some Y' pattern, populate
group_template_rows[label] with {genus, location}; otherwise punt to
manual_curation with similar UBERON stanzas as curator reference
- Output JSON gains group_template_rows and manual_curation keys
- Quality checks updated: every group term must end up in either
group_template_rows OR manual_curation
- Tools section notes obo-grep.pl may not be in PATH; awk fallback documented
CLAUDE.md updated with the dual-template flow:
- Stage 1 documents the term_type pre-classification
- Stage 3 enumerates the new agent responsibilities (steps 8 and 9)
- QC checklist split: shared / leaf-template / groups-template / reports
- Final Delivery registers both templates in uberon-odk.yaml
- Output Files Reference includes new groups template + manual_curation.tsv
- Column reference table now has separate sections for leaf and groups
ROADMAP marks Phase 2 implementation complete (pending end-to-end agent test).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
10 input terms processed by the new dual-template flow: - Stage 1: pre-classified into 8 leaf + 2 group rows - Stage 2: 8 group JSONs (7 leaf parent groups + 1 grouping_terms bucket) - Stage 3: 8 ntr-term-researcher agents (3 needed retry due to API stream timeouts; grouping_terms handled inline after retry stalled) - Stage 4: dual-template merge Final output: - 4 NEW leaf terms with Aristotelian definitions, real PMID/DOI refs, resolved is_a/part_of, Wikipedia + FMA xrefs - 0 NEW group terms — both 'anterior vertebral muscle' and 'circular pharyngeal muscle' resolved to existing UBERON terms via Step 2 OLS4 check, validating that confirmed_match takes precedence over Step 8 EC pattern detection - 5 confirmed_match (5 of 10 input terms already exist in UBERON): - ceratoglossus → UBERON:0008428 - chondroglossus muscle → UBERON:0011535 - costal diaphragm → UBERON:0035831 - aryepiglotticus muscle → UBERON:0011025 - constrictor muscle of pharynx → UBERON:0001569 (NEW from group flow) - 1 possible_match: anterior vertebral muscle ≈ UBERON:0008549 (prevertebral muscle) — flagged for curator decision (synonym vs new term) - 1 manual_curation: anterior vertebral muscle — UBERON's similar term uses attaches_to + innervated_by axioms, not the simple genus + part_of pattern, so out of EC template scope The test exercises: leaf+INFER parent, leaf+NEEDS_MAPPING:FMA parent, leaf+UNRESOLVABLE parent (including wrong-domain), group→confirmed_match, group→possible_match+manual_curation. All paths working as designed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Claude finished @dosumis's task in 5m 44s —— View job PR Review: Add HRA Muscular NTRReview checklist:
SummaryThis PR adds ROBOT template infrastructure and four new NTR (New Term Request) leaf terms for HRA (Human Reference Atlas) muscular-system structures, along with a bulk NTR workflow toolkit. The term-level research appears careful and the definition quality is high. However, there are two issues that should be addressed before merge, and several important follow-on items. 🔴 CRITICAL1. New templates not registered in ODK config — they will never compile
The precedent is 2. The definition research output (
and sets 🟡 IMPORTANT3. Row 5 of 4. ASCTB-TEMP IRIs in the All four terms carry For example, The ASCTB-TEMP IRI is the third token in the definition xref list. 5. Missing Per 6.
🔵 SUGGESTIONS7. Binary Excel file in repo root
8. ISBN references are non-standard in OBO/UBERON; 9. FMA parent for articularis genu (UBERON:0004252) is very broad FMA:22424 "Muscle of anterior compartment of thigh" has no UBERON equivalent, so the workflow fell back to UBERON:0004252 "hindlimb stylopod muscle". The fallback is documented in the definition output and the errors report, but UBERON:0004252 is a very coarse grouping. The definition output itself suggests considering UBERON:0001377 (quadriceps femoris) and explicitly notes that "articularis genu is anatomically discrete from the quadriceps". Curator should confirm the fallback is acceptable or note it for a follow-up parent term request. 10. Using 11. The PR adds 1,700+ lines of Python scripts, JSON outputs, and documentation for the bulk NTR pipeline. This infrastructure is valuable but may belong in a separate utilities repository or be Per-term verdict
Confirmed matches correctly excluded from template: The term research quality and definition writing are solid. The main blockers are the missing ODK registration (templates won't compile without it) and the relationship-type inconsistency on UBERON:9900010. Addressing the missing |
There was a problem hiding this comment.
Two blocking issues before merge:
-
Templates not wired into the build —
hra-muscular.template.tsvandhra-muscular-groups.template.tsvare not registered inuberon-odk.yamloruberon.Makefile. Without equivalent entries to thehra-skeletonprecedent these terms will never compile into the ontology. -
Relationship-type conflict on UBERON:9900010 — the definition workflow output (
unresolvable_meningohypophyseal_artery.json) explicitly recommendspart_offordorsal part of intertransversarii laterales lumborum musclebut the template usesis_a. These need to be reconciled.
Also flagged as important (see full comment):
- UBERON:9900008 (clavicular head) has no
is_a, onlypart_of - ASCTB-TEMP IRIs in the
def_xrefcolumn should be inxref term_tracker_itemmissing from both templates- Report files under
src/templates/hra-muscular-reports/should not live inside the template directory
There was a problem hiding this comment.
Two blocking issues before merge:
-
Templates not wired into the build — hra-muscular.template.tsv and hra-muscular-groups.template.tsv are not registered in uberon-odk.yaml or uberon.Makefile. Without equivalent entries to the hra-skeleton precedent these terms will never compile into the ontology.
-
Relationship-type conflict on UBERON:9900010 — the definition workflow output (unresolvable_meningohypophyseal_artery.json) explicitly recommends part_of for dorsal part of intertransversarii laterales lumborum muscle but the template uses is_a. These need to be reconciled.
Also flagged as important (see full comment):
- UBERON:9900008 (clavicular head) has no is_a, only part_of
- ASCTB-TEMP IRIs in the def_xref column should be in xref
- term_tracker_item missing from both templates
- Report files under src/templates/hra-muscular-reports/ should not live inside the template directory
There was a problem hiding this comment.
Two blocking issues before merge. 1) Templates not wired into the build: hra-muscular.template.tsv and hra-muscular-groups.template.tsv are not registered in uberon-odk.yaml or uberon.Makefile. 2) Relationship-type conflict on UBERON:9900010: workflow output says part_of but template uses is_a. See full review comment for details.
No description provided.