Skip to content

Add hra ovary ntr#3698

Draft
dosumis wants to merge 8 commits intomasterfrom
add-hra-ovary-ntr
Draft

Add hra ovary ntr#3698
dosumis wants to merge 8 commits intomasterfrom
add-hra-ovary-ntr

Conversation

@dosumis
Copy link
Copy Markdown
Contributor

@dosumis dosumis commented Apr 27, 2026

Testing agentic workflow for adding term bulk requests from HRA as ROBOT templated terms.

dosumis and others added 8 commits April 27, 2026 14:26
Adds bulk_ntr_workflow/ — a 4-stage pipeline for converting HRA ASCTB
unmapped terms into UBERON-ready ROBOT templates with QC reports.

Stages:
  1. generate_template.py  — reads xlsx/csv, classifies parent IDs
     (UBERON/FMA/ASCTB-TEMP), assigns UBERON:99xxxxx NTR IDs, writes
     initial template to src/templates/ and TSV reports to <name>-reports/
  2. group_terms_by_parent.py — groups by parent for parallel subagents
  3. ntr-term-researcher subagent — OLS4 existing-term check, Wikipedia
     definition lookup, relationship inference, FMA+ASCTB-TEMP parent resolution
  4. merge_definitions.py  — merges subagent JSON back into template

Also adds:
  - .claude/agents/ntr-term-researcher.md  (Stage 3 subagent)
  - .claude/agents/ontology-term-lookup.md (OLS4 helper subagent)
  - .claude/skills/fetch-wiki-info/         (Wikipedia/Wikidata skill)
  - .mcp.json  (OLS4, playwright, artl-mcp servers)
  - src/templates/hra-muscular.template.tsv (initial Stage 1 output, muscular system)
  - src/templates/hra-muscular-reports/     (input, errors, candidates TSVs)

Signed-off-by: @dragon-ai-agent
Adds --limit arg to generate_template.py and runs the 4-stage pipeline
on the first 10 muscular-system terms from hra_unmapped-asct-term-list:

Stage 1: generated hra-muscular-test template + reports
Stage 2: 8 groups (all 3 parent-ID types + wrong parent cases)
Stage 3: 8 parallel subagents: OLS4 matching, Wikipedia lookup, definitions
Stage 4: merged into 6-row final template; 4 confirmed matches excluded

Key findings:
- 4 confirmed UBERON matches: ceratoglossus (UBERON:0008428), chondroglossus
  (UBERON:0011535), costal diaphragm (UBERON:0035831), aryepiglotticus
  (UBERON:0011025)
- FMA parents resolved: FMA:22424→UBERON:0004252, FMA:46583→UBERON:0008575
- Wrong parents corrected: UBERON:0000975 (sternum)→UBERON:0001103 (diaphragm);
  meningohypophyseal artery→UBERON:0008242 (lower back muscle)
- All 6 new terms have content-rich Aristotelian definitions and resolved
  relationship types (no PENDING/INFER remaining)

Also fixes merge_definitions.py to tolerate field-name variations in
subagent JSON outputs (ntr_label/label, matched_id/uberon_id,
existing_term_match dict form).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a new 'xref' column (A oboInOwl:hasDbXref SPLIT=|) to the ROBOT
template for direct term annotations, distinct from def_xref which
annotates only the definition.

The column holds pipe-separated values populated by the subagent:
  Wikipedia:Article_Title  — article page for the specific term
  FMA:NNNNN               — FMA ID from OLS4 cross-references

Changes:
- generate_template.py: new TEMPLATE_HEADERS/DIRECTIVES entry + blank cell
- merge_definitions.py: reads 'xrefs' key from subagent JSON; COL_TERMREF=11
- ntr-term-researcher.md: documents when/how to populate xrefs output key
- CLAUDE.md: adds xref row to column reference table

Xrefs column is empty in the current test run; subagents on future
groups will populate it once they record Wikipedia article URLs + FMA IDs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a term's own IRI in the source table is an FMA IRI, Stage 1 now
pre-populates the xref column with the extracted FMA:NNNNN curie rather
than leaving it to the subagent alone.

Stage 4 merge appends subagent-provided xrefs (Wikipedia URL + FMA IDs
found via OLS4) to the pre-existing cell value, deduplicating by
pipe-segment, so neither source overwrites the other.

This covers the case where future input files have FMA IRIs in the 'as'
column. The current ASCTB-TEMP-only dataset produces empty xref cells in
Stage 1 as before; subagent xrefs continue to be added in Stage 4.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reviews all 75 muscular-system terms; classifies ~22 as grouping terms
and ~49 as leaf nodes using latent knowledge. Documents linguistic cues,
borderline cases, required workflow changes (term_type column, subagent
behaviour split), draft rule set, and Phases 3-5.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Processes ovarycsv.csv via the bulk NTR workflow:
- Stage 1: generates hra-ovary.template.tsv from 13 ASCTB-TEMP ovary terms
- Stage 3: 9 subagent groups resolve definitions, correct wrong parent IDs
  (source CSV had anatomically wrong parent IDs — bones for follicle terms —
  subagents found correct UBERON parents via OLS4 and biological context)
- Stage 4: merges definitions; 0 PENDING, 0 INFER remaining

New terms (UBERON:9900100–9900112):
  corona radiata, corpus luteum granulosa lutein/theca layers,
  cumulus oophorus oocyte complex, dominant/early/pre-selection/selection
  antral follicles, hemorrhagic anovulatory follicle, luteinized unruptured
  follicle, multilayer/transitional primary/primordial ovarian follicles

Also:
- Fix generate_template.py to handle UTF-8 BOM in CSV input (utf-8-sig)
- Add hra-ovary-prefixes.owl (declares foaf:depiction for Wikipedia images)
- Register hra_ovary.owl in uberon-odk.yaml and uberon.Makefile

Curator notes (see candidates.tsv):
- corona radiata: label collision with UBERON:0004682 synonym; consider
  "corona radiata of cumulus oophorus" as primary label
- cumulus oophorus oocyte complex: possible overlap with UBERON:0001306
- multilayer ovarian follicle: possible overlap with UBERON:0000036

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@dragon-ai-agent
Source input file from HRA ASCTB unmapped terms for the ovary table.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@dragon-ai-agent
@dosumis dosumis requested a review from matentzn as a code owner April 27, 2026 15:58
@ai4c-agent
Copy link
Copy Markdown

ai4c-agent Bot commented Apr 27, 2026

Claude finished @dosumis's task in 5m 33s —— View job


PR Review: Add HRA Ovary NTR

Review checklist:

  • Hierarchy and parents are consistent — issues found (see below)
  • Definitions and supporting references look adequate — issues found
  • Obsolete or replacement handling is safe — N/A (no obsoletions)
  • OBO and ODK conventions appear preserved — minor issues
  • No obvious CI, auth, or workflow regression — OK (muscular template not wired in)

Summary

This PR adds 13 new ovary NTR terms (UBERON:9900100–9900112) via a ROBOT template, together with the bulk NTR workflow tooling used to generate them. The workflow approach is well-structured, but there are two critical ontological errors in the final template that must be fixed before merge, plus several important hierarchy and metadata issues.


🔴 CRITICAL

1. Corpus luteum layers classified as types of corpus luteum (src/templates/hra-ovary.template.tsv, rows 4–5)

corpus luteum granulosa lutein layer (UBERON:9900101) and corpus luteum granulosa theca layer (UBERON:9900102) both have is_a UBERON:0002512 (corpus luteum). A layer of the corpus luteum is part of it, not a type of it. This puts them in the is_a column of the template, generating SubClassOf(UBERON:0002512) — logically this says "a granulosa lutein layer IS A corpus luteum", which is incorrect.

Fix: move UBERON:0002512 from the is_a column to the part_of column for both rows, so the template generates SubClassOf(BFO:0000050 some UBERON:0002512).


🟡 IMPORTANT

2. cumulus oophorus oocyte complex (9900103) classified is_a ovarian follicle (hra-ovary.template.tsv, row 6)

The COC is a structure that resides within an antral follicle; it is not a type of ovarian follicle (UBERON:0001305). The correct relationship is part_of UBERON:0001305. Move the parent ID to the part_of column.

3. Missing real def_xrefs for three terms

corona radiata (9900100), corpus luteum granulosa lutein layer (9900101), and corpus luteum granulosa theca layer (9900102) have only the ASCTB-TEMP placeholder IRI as their definition xref. These placeholders do not resolve to actual publications. Per UBERON policy ("all terms should have definitions with at least one definition xref, ideally a PMID"), real references (PMIDs or DOIs) must be added.

4. transitional primary ovarian follicle (9900111) has an overly broad parent

Parent is UBERON:0001305 (ovarian follicle, generic). There is already UBERON:0000035 (primary ovarian follicle) in the ontology — a transitional primary follicle is clearly a subtype of the primary follicle stage, not just any ovarian follicle. Update is_a to UBERON:0000035.

5. multilayer ovarian follicle (9900108) has an overly broad parent

Parent is UBERON:0001305 (ovarian follicle). This term describes a preantral follicle at an advanced stage before antrum formation, which maps closely to UBERON:0000036 (secondary ovarian follicle, defined as "two or more layers of granulosa cells, up to the onset of antrum formation"). Consider is_a UBERON:0000036.

6. Pathological terms may not belong in UBERON

hemorrhagic anovulatory follicle (9900106) and luteinized unruptured follicle (9900107) describe dysfunctional/pathological follicle states ("fails to ovulate", "fails to rupture"). UBERON is a normal-anatomy ontology. These terms may be more appropriate in MONDO or as PATO-qualified terms. At minimum, a justification for including pathological structure types in UBERON should be provided.

7. Non-standard term name: dominance antral follicle (9900104)

The standard anatomical term is dominant antral follicle, not "dominance". The source HRA data used "dominance" but UBERON should normalise to the standard form; "dominance antral follicle" can be kept as a synonym if needed.


🔵 SUGGESTION

8. hra-muscular.template.tsv contains unresolved placeholders

src/templates/hra-muscular.template.tsv still has [PENDING] definitions and INFER:, UNRESOLVABLE:, NEEDS_MAPPING: prefix values throughout. It is not registered in the Makefile/ODK YAML, so it won't break the build, but committing an incomplete template is misleading. Either complete it or keep it out of src/templates/ until ready.

9. ovarycsv.csv is in the repo root

This source file (ovarycsv.csv) should live in bulk_ntr_workflow/source_data/ alongside the workflow that uses it, not in the repository root.

10. .mcp.json project-level MCP configuration

Adding .mcp.json to the repo root configures artl-mcp, ols4 (external HTTP), and playwright (via npx) for all Claude Code users in this repo. This is intentional for the NTR workflow, but worth confirming with the team that this is the desired scope (project-wide vs. personal ~/.claude/ config).

11. Intermediate workflow artifacts in bulk_ntr_workflow/outputs/definitions/

This directory contains many per-group JSON files that are processing intermediates. Consider adding bulk_ntr_workflow/outputs/ to .gitignore and only committing the final template_final.tsv.

12. Missing trailing newline in src/ontology/uberon-odk.yaml

The file still ends without a trailing newline after the PR's additions.


Merge recommendation

🔴 Request changes — the two corpus-luteum-layer hierarchy errors (#1) and the COC hierarchy error (#2) are logical axiom mistakes that will produce incorrect OWL class assertions once the template is run, and the missing real references (#3) violate UBERON metadata policy. These must be fixed before the template is finalised.

Copy link
Copy Markdown

@ai4c-agent ai4c-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds 13 new HRA ovary NTR terms and bulk NTR workflow tooling. There are two critical ontological errors that must be fixed before merge:

  1. Corpus luteum layers in wrong relationship (rows 4–5 of hra-ovary.template.tsv): UBERON:9900101 and UBERON:9900102 use is_a UBERON:0002512 (corpus luteum) but should use part_of UBERON:0002512 — a layer is part of the corpus luteum, not a type of it.

  2. Cumulus oophorus oocyte complex hierarchy (row 6): UBERON:9900103 is classified is_a UBERON:0001305 (ovarian follicle) but the COC is part of a follicle, not a type of follicle. Should be part_of UBERON:0001305.

Additionally: three terms lack real definition xrefs (only have ASCTB-TEMP placeholders); transitional primary ovarian follicle should use UBERON:0000035 as parent rather than the generic UBERON:0001305; dominance antral follicle should be renamed to dominant antral follicle; and two terms (hemorrhagic anovulatory follicle, luteinized unruptured follicle) describe pathological states that may not belong in a normal-anatomy ontology.

See full review in the PR comment for details on all issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant