Skip to content

bug(medcat): CU-869ckx6dr Allow for better supervised training#374

Open
mart-r wants to merge 15 commits intomainfrom
bug/medcat/CU-869ckx6dr-allow-for-better-supervised-training
Open

bug(medcat): CU-869ckx6dr Allow for better supervised training#374
mart-r wants to merge 15 commits intomainfrom
bug/medcat/CU-869ckx6dr-allow-for-better-supervised-training

Conversation

@mart-r
Copy link
Copy Markdown
Collaborator

@mart-r mart-r commented Mar 24, 2026

During supervised training the document and entity provided to the trained component (linker only for now) don't align with each other.
There are 2 distinct issues:

  • Since the document is generated from the regular inference procedure, it has .ner_ents and .linked_ents lists, but they are not aligned with what the annotated dataset specifies.
  • The entity passed to the trained component is always a new instance with new state instead of being reused

This PR fixes the above by:

  • Preparing the .ner_ents and .linked_ents for the document when doing supervised training (they will contain the same entities for now)
  • Reworking how entity creation is done in this context in order to be able to reuse these entities (so the MutableEntity is within the .linked_ents)
    • This involved creating a new method entity_from_tokens_in_doc for the pipe and tokenizers
    • And deprecating the old one (entity_from_tokens)
  • It also adds a few tests to support the above and updates other tests along with the changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant