Skip to content

Add hra muscular ntr#3700

Open
dosumis wants to merge 10 commits intomasterfrom
add-hra-muscular-ntr
Open

Add hra muscular ntr#3700
dosumis wants to merge 10 commits intomasterfrom
add-hra-muscular-ntr

Conversation

@dosumis
Copy link
Copy Markdown
Contributor

@dosumis dosumis commented Apr 28, 2026

No description provided.

dosumis and others added 10 commits April 27, 2026 15:01
Four-stage pipeline for generating UBERON new term request ROBOT
templates from HRA ASCTB unmapped term tables:

Stage 1 (generate_template.py): reads xlsx/csv input, classifies parent
IDs (UBERON/FMA/ASCTB-TEMP), assigns UBERON:99xxxxx provisional IDs,
writes initial ROBOT template TSV + error and candidate reports.

Stage 2 (group_terms_by_parent.py): groups template rows by parent and
writes per-group JSON files for parallel subagent processing.

Stage 3 (ntr-term-researcher agent): resolves FMA/ASCTB-TEMP parents via
OLS4, checks for existing UBERON matches, writes Aristotelian definitions
from Wikipedia, resolves is_a vs part_of relationship types.

Stage 4 (merge_definitions.py): merges subagent outputs back into the
template; appends confirmed/possible OLS4 matches to candidates report.

Template columns: ID, LABEL, Definition, def_xref (definition annotation),
is_a, part_of, In_subset, Date, Contributor, Present_in_taxon,
Wikipedia_image (foaf:depiction), xref (direct oboInOwl:hasDbXref for
Wikipedia article URL + FMA ID).

Supporting agents/skills:
- ntr-term-researcher: Stage 3 subagent spec
- ontology-term-lookup: OLS4 structured search
- fetch-wiki-info: Wikidata + Wikipedia lookup
- .mcp.json: ols4, artl-mcp, playwright MCP servers

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… plans

Phases covered:
- Phase 2: grouping vs leaf-node term distinction (linguistic rules, subagent behaviour)
- Phase 3: detect UBERON label-ID mismatches in Stage 1; new WRONG_PARENT: placeholder;
  multi-valued parent column splitting; subagent protocol for mismatch correction
  (informed by ovary run where 7/13 terms had wrong-domain UBERON parent IDs silently accepted)
- Phase 4: scale to full muscular-system table
- Phase 5: generalise to other ASCTB anatomical systems

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@dragon-ai-agent
merge_definitions.py:
- Fallback path (parent resolved but rel type unknown) now leaves both
  is_a and part_of blank rather than double-setting them, and lists
  affected term labels in the summary output under 'Relationship
  unresolved' for curator attention
- Remove dead 'if jf.parent.name == "input"' guard — glob never matches
  files in subdirectories

generate_template.py:
- Remove dead write_tsv call with doubled headers that was immediately
  overwritten by the block below it
- Fix counter order: use counter for ID, then increment (was: increment
  then use counter-1)
- Remove hardcoded CONTRIBUTOR_IRI constant; add --contributor CLI arg
  with ORCID format validation; prompts interactively if not supplied

group_terms_by_parent.py:
- Remove derive_wikipedia_urls call and wikipedia_urls field from output
  JSON — parent_label is always "" so the call always returned []; the
  subagent derives Wikipedia URLs independently during lookup

ntr-term-researcher.md:
- Clarify that Wikipedia article page URL (not image URL) goes in xrefs
  at point of successful lookup, as Wikipedia:Article_Title
- Add image relevance check: verify caption/alt text confirms the image
  illustrates the target structure before storing it

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…name flagging

Addresses issues found in the ovary branch test run where the agent:
- classified layers (corpus luteum granulosa lutein/theca) as is_a parents (should be part_of)
- accepted source-provided broad parents instead of finding more specific ones
- left ASCTB-TEMP placeholders as the only def_xref (no real PMIDs)
- did not flag pathological terms (hemorrhagic, luteinized unruptured) as out of scope
- did not normalise non-standard names ('dominance' instead of 'dominant')

ntr-term-researcher.md changes:
- Step 1 expanded: after confirming source parent, agent must search OLS4 for a more
  specific parent (e.g. primary/secondary ovarian follicle vs generic ovarian follicle)
- New Step 3: scope check (pathological/dysfunctional → out_of_scope) and name check
  (non-standard → name_corrections with curator-reviewable suggestion)
- New Step 5: literature search — must find at least one real PMID/DOI for def_xref;
  ASCTB-TEMP placeholders explicitly disallowed as the only reference
- Step 7 (relationship resolution) rewritten with explicit structural vocabulary:
  layers, zones, heads, bellies, parts, compartments, walls → ALWAYS part_of
  subtypes/stages/members of grouping classes → is_a
  Quick test ('is a kind of' vs 'is part of') with worked examples
- Output JSON adds: def_xrefs_to_add, out_of_scope, name_corrections keys
- Quality checks expanded with explicit rules for layers, pathology, naming

merge_definitions.py changes:
- Refactored load_subagent_outputs to return single dict (less argument tuple churn)
- New behaviour: out_of_scope terms excluded from template (not just confirmed_matches);
  written to <name>-reports/out_of_scope.tsv for curator review
- New behaviour: name_corrections applied to LABEL column; original-source mapping
  written to <name>-reports/name_corrections.tsv
- New behaviour: def_xrefs_to_add appended to def_xref column with deduplication
- Lookup helper accepts both source and corrected labels (agent may key by either)
- Summary output extended with new counters

CLAUDE.md changes:
- Stage 3 description updated to enumerate the new agent responsibilities
- QC checklist extended: real def_xref required, layer/part_of rule, out_of_scope
  and name_corrections review steps
- Output Files Reference adds the two new report files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Surveyed 19 existing UBERON 'muscle of X' terms. 14 (74%) use the simple
'genus + part_of some Y' pattern with UBERON:0014892 (skeletal muscle organ,
vertebrate) as genus. 3 use attaches_to_part_of, 2 lack logical definition.

Decision gate passed: simple part_of pattern covers majority of existing
convention. Phase 2 implementation will support genus + part_of only;
attaches_to_part_of, innervated_by, and multi-axiom patterns deferred to
future phases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ates

generate_template.py now classifies each input row as 'leaf' or 'group' using
linguistic regex rules (GROUP_PATTERNS / LEAF_PART_PATTERNS in classify_term_type).

- Leaf rows go to <name>.template.tsv with SC/part_of directives (existing)
- Group rows go to <name>-groups.template.tsv with EC genus + EC part_of some
  location directives (new) — genus and location columns left blank for the
  agent to fill

input.tsv gains a term_type column so curators can see the classification.

Smoke-tested on muscular-system: 20 group / 55 leaf rows out of 75 input terms,
matching ROADMAP prediction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
group_terms_by_parent.py now reads both template_initial.tsv and
template_groups_initial.tsv. Leaf rows are grouped by parent UBERON ID as
before. Grouping rows are pooled into a single 'grouping_terms' bucket since
their genus + location values are agent-determined per term, not shared by a
common parent.

Each per-term entry includes term_type ('leaf' or 'group'). Each per-group
JSON has a term_counts summary so curators can see the leaf/group split.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
merge_definitions.py now merges subagent outputs into both the leaf and groups
templates. Common fields (definitions, images, xrefs, def_xrefs) are applied
identically; logic columns differ:

- Leaf template: resolved_relationships -> is_a/part_of (existing)
- Groups template: group_template_rows[label] -> {genus, location} populates
  the EC genus and EC part_of some location columns

Group rows missing the agent's genus+location output are flagged 'EC
incomplete' in the summary so curators can investigate.

New report: manual_curation.tsv lists group terms the agent punted (couldn't
fit the simple genus + part_of some Y pattern); includes proposed definition,
reason, and similar UBERON terms found via obo-grep for curator reference.

Refactored row processing into _apply_common_fields helper plus per-template
merge functions (merge_leaf_template, merge_groups_template) so the two
templates share definition/xref/image logic without duplication.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ia obo-grep

ntr-term-researcher.md updated to handle the leaf/group split introduced by
Stage 1 pre-classification:

- New top-of-file 'Term types' paragraph explaining the leaf vs group split
- Input section documents term_type field, term_counts, GROUPING_TERMS bucket
- Step 6 (Write Definitions) now branches: leaf gets Aristotelian form,
  group gets collective form ('A group of muscles that...')
- Step 7 (Resolve Relationship Types) explicitly LEAF-only
- New Step 8 for GROUP terms: use awk over uberon-edit.obo to find similar
  group terms; if they use 'genus + part_of some Y' pattern, populate
  group_template_rows[label] with {genus, location}; otherwise punt to
  manual_curation with similar UBERON stanzas as curator reference
- Output JSON gains group_template_rows and manual_curation keys
- Quality checks updated: every group term must end up in either
  group_template_rows OR manual_curation
- Tools section notes obo-grep.pl may not be in PATH; awk fallback documented

CLAUDE.md updated with the dual-template flow:
- Stage 1 documents the term_type pre-classification
- Stage 3 enumerates the new agent responsibilities (steps 8 and 9)
- QC checklist split: shared / leaf-template / groups-template / reports
- Final Delivery registers both templates in uberon-odk.yaml
- Output Files Reference includes new groups template + manual_curation.tsv
- Column reference table now has separate sections for leaf and groups

ROADMAP marks Phase 2 implementation complete (pending end-to-end agent test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
10 input terms processed by the new dual-template flow:
- Stage 1: pre-classified into 8 leaf + 2 group rows
- Stage 2: 8 group JSONs (7 leaf parent groups + 1 grouping_terms bucket)
- Stage 3: 8 ntr-term-researcher agents (3 needed retry due to API stream
  timeouts; grouping_terms handled inline after retry stalled)
- Stage 4: dual-template merge

Final output:
- 4 NEW leaf terms with Aristotelian definitions, real PMID/DOI refs,
  resolved is_a/part_of, Wikipedia + FMA xrefs
- 0 NEW group terms — both 'anterior vertebral muscle' and
  'circular pharyngeal muscle' resolved to existing UBERON terms via
  Step 2 OLS4 check, validating that confirmed_match takes precedence
  over Step 8 EC pattern detection
- 5 confirmed_match (5 of 10 input terms already exist in UBERON):
  - ceratoglossus → UBERON:0008428
  - chondroglossus muscle → UBERON:0011535
  - costal diaphragm → UBERON:0035831
  - aryepiglotticus muscle → UBERON:0011025
  - constrictor muscle of pharynx → UBERON:0001569 (NEW from group flow)
- 1 possible_match: anterior vertebral muscle ≈ UBERON:0008549 (prevertebral
  muscle) — flagged for curator decision (synonym vs new term)
- 1 manual_curation: anterior vertebral muscle — UBERON's similar term uses
  attaches_to + innervated_by axioms, not the simple genus + part_of
  pattern, so out of EC template scope

The test exercises: leaf+INFER parent, leaf+NEEDS_MAPPING:FMA parent,
leaf+UNRESOLVABLE parent (including wrong-domain), group→confirmed_match,
group→possible_match+manual_curation. All paths working as designed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ai4c-agent
Copy link
Copy Markdown

ai4c-agent Bot commented Apr 28, 2026

Claude finished @dosumis's task in 5m 44s —— View job


PR Review: Add HRA Muscular NTR

Review checklist:

  • Hierarchy and parents are consistent
  • Definitions and supporting references look adequate
  • Obsolete or replacement handling is safe
  • OBO and ODK conventions appear preserved
  • No obvious CI, auth, or workflow regression

Summary

This PR adds ROBOT template infrastructure and four new NTR (New Term Request) leaf terms for HRA (Human Reference Atlas) muscular-system structures, along with a bulk NTR workflow toolkit. The term-level research appears careful and the definition quality is high. However, there are two issues that should be addressed before merge, and several important follow-on items.


🔴 CRITICAL

1. New templates not registered in ODK config — they will never compile

src/templates/hra-muscular.template.tsv and src/templates/hra-muscular-groups.template.tsv are added, but neither appears in src/ontology/uberon-odk.yaml nor has a corresponding rule in src/ontology/uberon.Makefile.

The precedent is hra-skeleton.template.tsv, which is listed under use_template: true in uberon-odk.yaml and has a custom $(COMPONENTSDIR)/hra_skeleton.owl rule in uberon.Makefile. Without equivalent registration the four NTR terms will never be built into the ontology.

2. dorsal part of intertransversarii laterales lumborum muscle — relationship type mismatch between workflow output and template

The definition research output (bulk_ntr_workflow/outputs/definitions/unresolvable_meningohypophyseal_artery.json) explicitly says:

"Note that 'dorsal part of' implies a subdivision, hence part_of relationship rather than is_a."

and sets resolved_relationships: "part_of". Yet src/templates/hra-muscular.template.tsv (row 6, UBERON:9900010) places UBERON:0008242 in the is_a column, not the part_of column. The research output and the final template are in direct conflict. Curator should decide the correct relationship and make the template consistent.


🟡 IMPORTANT

3. clavicular head of pectoralis major muscle (UBERON:9900008) — no is_a in template

Row 5 of hra-muscular.template.tsv has an empty is_a column and only part_of UBERON:0002381. In OBO format every term needs at least one explicit is_a (or a logical definition from which the reasoner can infer one). The template currently produces a term with only a BFO:0000050 some pectoralis_major SubClassOf axiom and no direct type. A parent such as "muscle head" (if such a class exists or should be created) or at minimum a generic skeletal muscle parent should be added.

4. ASCTB-TEMP IRIs in the def_xref column

All four terms carry https://purl.org/ccf/ASCTB-TEMP_<label> in the def_xref column (the column mapped to oboInOwl:hasDbXref on the definition). These are ephemeral provenance tracking IRIs from the HRA ASCT+B tables, not stable literature citations. They should be moved to the general xref column (or dropped) so they are not asserted as definition sources.

For example, anterior cervical intertransversarii muscle cites:

https://fipat.library.dal.ca/ta2/|ISBN:9780323393225|https://purl.org/ccf/ASCTB-TEMP_anterior-cervical-intertransversarii-muscle|PMID:12650404

The ASCTB-TEMP IRI is the third token in the definition xref list.

5. Missing term_tracker_item in both templates

Per CLAUDE.md and Uberon conventions, terms should link back to the originating GitHub issue with a term_tracker_item property. Neither template has a column for this. The hra-skeleton precedent and general UBERON practice require it for traceable provenance.

6. hra-muscular-reports/ placed under src/templates/

src/templates/hra-muscular-reports/ contains pipeline outputs (candidates.tsv, errors.tsv, input.tsv, manual_curation.tsv), not template files. These will confuse any tool that iterates src/templates/*.tsv or src/templates/**/*.tsv. Reports should live in bulk_ntr_workflow/outputs/ or similar, not inside the template directory.


🔵 SUGGESTIONS

7. Binary Excel file in repo root

hra_unmapped-asct-term-list-with-refs.xlsx (277 KB) is added to the repository root. Binary blobs in a git repo grow the clone size permanently. If this is source data, consider storing it under src/datasets/ with a brief README, or keeping it outside git (linked from the issue tracker).

8. ISBN:9780323393225 as a def_xref

ISBN references are non-standard in OBO/UBERON; PMID: and doi: are the expected citation forms. The TA2 URL (https://fipat.library.dal.ca/ta2/) is already included and serves as the Terminologia Anatomica reference. The ISBN could be removed from def_xref or moved to a comment.

9. FMA parent for articularis genu (UBERON:0004252) is very broad

FMA:22424 "Muscle of anterior compartment of thigh" has no UBERON equivalent, so the workflow fell back to UBERON:0004252 "hindlimb stylopod muscle". The fallback is documented in the definition output and the errors report, but UBERON:0004252 is a very coarse grouping. The definition output itself suggests considering UBERON:0001377 (quadriceps femoris) and explicitly notes that "articularis genu is anatomically discrete from the quadriceps". Curator should confirm the fallback is acceptable or note it for a follow-up parent term request.

10. @playwright/mcp@latest in .mcp.json is a floating version

Using @latest means the Playwright MCP server version is not pinned. Different contributors will pull different versions, which can affect reproducibility of the research workflow. Pinning to a specific semver is safer for shared infrastructure.

11. bulk_ntr_workflow/ adds substantial non-ontology content

The PR adds 1,700+ lines of Python scripts, JSON outputs, and documentation for the bulk NTR pipeline. This infrastructure is valuable but may belong in a separate utilities repository or be .gitignored when done. If it stays, the bulk_ntr_workflow/CLAUDE.md should be moved to somewhere that does not accidentally extend the project-level CLAUDE.md scope.


Per-term verdict

NTR ID Label Parent Status
UBERON:9900001 anterior cervical intertransversarii muscle is_a UBERON:0002324 (muscle of back) ✅ Reasonable; parent broad but defensible
UBERON:9900003 articularis genu muscle is_a UBERON:0004252 (hindlimb stylopod muscle) 🟡 Fallback parent; FMA:22437 xref looks correct
UBERON:9900008 clavicular head of pectoralis major muscle part_of UBERON:0002381 (pectoralis major) 🔴 No is_a
UBERON:9900010 dorsal part of intertransversarii laterales lumborum muscle template: is_a UBERON:0008242; workflow output: part_of UBERON:0008242 🔴 Relationship type conflict

Confirmed matches correctly excluded from template: aryepiglottic part of oblique arytenoid muscle (→ UBERON:0011025), costal part of respiratory diaphragm muscle (→ UBERON:0035831), ceratoglossus of hyoglossus muscle (→ UBERON:0008428), chondroglossus of hyoglossus muscle (→ UBERON:0011535), circular pharyngeal muscle (→ UBERON:0001569).


The term research quality and definition writing are solid. The main blockers are the missing ODK registration (templates won't compile without it) and the relationship-type inconsistency on UBERON:9900010. Addressing the missing is_a on UBERON:9900008 and the ASCTB-TEMP def_xrefs would bring the templates into line with Uberon conventions.

Copy link
Copy Markdown

@ai4c-agent ai4c-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two blocking issues before merge:

  1. Templates not wired into the buildhra-muscular.template.tsv and hra-muscular-groups.template.tsv are not registered in uberon-odk.yaml or uberon.Makefile. Without equivalent entries to the hra-skeleton precedent these terms will never compile into the ontology.

  2. Relationship-type conflict on UBERON:9900010 — the definition workflow output (unresolvable_meningohypophyseal_artery.json) explicitly recommends part_of for dorsal part of intertransversarii laterales lumborum muscle but the template uses is_a. These need to be reconciled.

Also flagged as important (see full comment):

  • UBERON:9900008 (clavicular head) has no is_a, only part_of
  • ASCTB-TEMP IRIs in the def_xref column should be in xref
  • term_tracker_item missing from both templates
  • Report files under src/templates/hra-muscular-reports/ should not live inside the template directory

Copy link
Copy Markdown

@ai4c-agent ai4c-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two blocking issues before merge:

  1. Templates not wired into the build — hra-muscular.template.tsv and hra-muscular-groups.template.tsv are not registered in uberon-odk.yaml or uberon.Makefile. Without equivalent entries to the hra-skeleton precedent these terms will never compile into the ontology.

  2. Relationship-type conflict on UBERON:9900010 — the definition workflow output (unresolvable_meningohypophyseal_artery.json) explicitly recommends part_of for dorsal part of intertransversarii laterales lumborum muscle but the template uses is_a. These need to be reconciled.

Also flagged as important (see full comment):

  • UBERON:9900008 (clavicular head) has no is_a, only part_of
  • ASCTB-TEMP IRIs in the def_xref column should be in xref
  • term_tracker_item missing from both templates
  • Report files under src/templates/hra-muscular-reports/ should not live inside the template directory

Copy link
Copy Markdown

@ai4c-agent ai4c-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two blocking issues before merge. 1) Templates not wired into the build: hra-muscular.template.tsv and hra-muscular-groups.template.tsv are not registered in uberon-odk.yaml or uberon.Makefile. 2) Relationship-type conflict on UBERON:9900010: workflow output says part_of but template uses is_a. See full review comment for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant