bug(medcat): CU-869ckx6dr Allow for better supervised training by mart-r · Pull Request #374 · CogStack/cogstack-nlp

mart-r · 2026-03-24T12:30:36Z

During supervised training the document and entity provided to the trained component (linker only for now) don't align with each other.
There are 2 distinct issues:

Since the document is generated from the regular inference procedure, it has .ner_ents and .linked_ents lists, but they are not aligned with what the annotated dataset specifies.
The entity passed to the trained component is always a new instance with new state instead of being reused

This PR fixes the above by:

Preparing the .ner_ents and .linked_ents for the document when doing supervised training (they will contain the same entities for now)
Reworking how entity creation is done in this context in order to be able to reuse these entities (so the MutableEntity is within the .linked_ents)
- This involved creating a new method entity_from_tokens_in_doc for the pipe and tokenizers
- And deprecating the old one (entity_from_tokens)
It also adds a few tests to support the above and updates other tests along with the changes

…iple projects

… on tokens

…time

…ns method in pipe

…ens method in tokenizers

github-actions bot added 15 commits March 24, 2026 12:25

CU-869ckx6dr: Add extra test to trainer to make sure it tests on mult…

760b9e1

…iple projects

CU-869ckx6dr: Add new method for reuse of entities when getting based…

4104863

… on tokens

CU-869ckx6dr: Add simple test for entity persitance in document

e3c6f77

CU-869ckx6dr: Small addition to test

214c9b1

CU-869ckx6dr: Prepare document with appropriate entities at training …

3832ecb

…time

CU-869ckx6dr: Update tests to work with new setup

523d6b9

CU-869ckx6dr: Add a new test for entities in add_and_train_concept.

97a532c

CU-869ckx6dr: Add deprecation arning to old / unused entity_from_toke…

5ff0464

…ns method in pipe

CU-869ckx6dr: Add deprecation warning to old / unused entity_from_tok…

5cd5862

…ens method in tokenizers

CU-869ckx6dr: Deprecate unused method on a protocol level as well

c3c3822

CU-869ckx6dr: Fix linting issue

b03adc3

CU-869ckx6dr: Fix minor issues in test-time supervised triaining data

242c602

CU-869ckx6dr: Add enw test for order of training examples

1ac9df9

CU-869ckx6dr: Minor changes to trainer tests

ec32f5d

CU-869ckx6dr: Allow a little longer for the relcat tutorial to run

46c9b88

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(medcat): CU-869ckx6dr Allow for better supervised training#374

bug(medcat): CU-869ckx6dr Allow for better supervised training#374
mart-r wants to merge 15 commits intomainfrom
bug/medcat/CU-869ckx6dr-allow-for-better-supervised-training

mart-r commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mart-r commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant