Hash property or equality check for `InMemoryLookupKB`s #13892

cdjameson · 2025-11-07T22:28:10Z

cdjameson
Nov 7, 2025

Wondering if it would be possible to implement a consistent hashing or and equality check (i.e. __eq__ method) for InMemoryLookupKB?

I have a couple of different use-cases. First is that I have a data flow of the KnowledgeBase data through a few steps in a few different formats. From an RDF graph, parsed into a spaCy KB, and then into a custom serialization format. I have methods to convert from the KB to the other formats to then check for equality, but there are a couple contexts where it would be good to check the hash of the KB itself instead of having to convert both. The other is that we have deployed spaCy models (including EntityLinkers as the main result), and we have an API that wants to update the KnowledgeBase of the Language. Currently, I'm just building the KB anew with the data and then setting that wholemeal, but this is very slow. It would be great to at first check if anything has even changed.

For now, I just look at the alias and entity definitions using something like sorted(set(kb.get_alias_strings())) == sorted(all_aliases_from_my_data) and sorted(set(kb.get_entity_strings())) == sorted(all_entities_from_my_data). But then recently we've started using corpus data to update prior probability, so now I also need to pairwise check to see if kb.get_prior_prob has changed, etc. All said, writing this custom check wasn't bad but I worry that I would still be missing things that could matter, like the KB's Vocab which itself doesn't have consistent __eq__ checking.

In light of the second use-case, I suppose a stretch ask would be if it is possible to add methods that more directly enable diffing logic, like __contains__. I haven't tried to implement this for my own software yet, but I assume that it would have the same challenges as the binary check I'm doing; even if I implement ent_id in kb.get_entity_strings() for ent_id in all_entities_from_my_data and alias in kb.get_alias_strings() for alias in all_aliases_from_my_data type of checks, that won't be checking all the data that's in the KB.

In summary, there are partial workarounds to this without a lot of craziness to handle every bit of data and strange edge cases, but some methods that would fully solve the problem seem reasonable given the type of data that KBs are holding.

P.S. I am aware that changing the KB without updating the EntityLinker model (Thinc ML model) itself would cause problems. I've handled that downstream problem in a variety of other ways. I can see how that would be an answer that works for most cases though: "if you're updating the KnowledgeBase, you are just training a whole new Language anyways".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Hash property or equality check for `InMemoryLookupKB`s #13892

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Hash property or equality check for InMemoryLookupKBs #13892

Uh oh!

cdjameson Nov 7, 2025

Replies: 0 comments

Hash property or equality check for `InMemoryLookupKB`s #13892

cdjameson
Nov 7, 2025