Hash property or equality check for InMemoryLookupKBs
#13892
cdjameson
started this conversation in
New Features & Project Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Wondering if it would be possible to implement a consistent hashing or and equality check (i.e.
__eq__method) forInMemoryLookupKB?I have a couple of different use-cases. First is that I have a data flow of the KnowledgeBase data through a few steps in a few different formats. From an RDF graph, parsed into a spaCy KB, and then into a custom serialization format. I have methods to convert from the KB to the other formats to then check for equality, but there are a couple contexts where it would be good to check the hash of the KB itself instead of having to convert both. The other is that we have deployed spaCy models (including EntityLinkers as the main result), and we have an API that wants to update the KnowledgeBase of the Language. Currently, I'm just building the KB anew with the data and then setting that wholemeal, but this is very slow. It would be great to at first check if anything has even changed.
For now, I just look at the alias and entity definitions using something like
sorted(set(kb.get_alias_strings())) == sorted(all_aliases_from_my_data)andsorted(set(kb.get_entity_strings())) == sorted(all_entities_from_my_data). But then recently we've started using corpus data to update prior probability, so now I also need to pairwise check to see ifkb.get_prior_probhas changed, etc. All said, writing this custom check wasn't bad but I worry that I would still be missing things that could matter, like the KB'sVocabwhich itself doesn't have consistent__eq__checking.In light of the second use-case, I suppose a stretch ask would be if it is possible to add methods that more directly enable diffing logic, like
__contains__. I haven't tried to implement this for my own software yet, but I assume that it would have the same challenges as the binary check I'm doing; even if I implementent_id in kb.get_entity_strings() for ent_id in all_entities_from_my_dataandalias in kb.get_alias_strings() for alias in all_aliases_from_my_datatype of checks, that won't be checking all the data that's in the KB.In summary, there are partial workarounds to this without a lot of craziness to handle every bit of data and strange edge cases, but some methods that would fully solve the problem seem reasonable given the type of data that KBs are holding.
P.S. I am aware that changing the KB without updating the EntityLinker
model(Thinc ML model) itself would cause problems. I've handled that downstream problem in a variety of other ways. I can see how that would be an answer that works for most cases though: "if you're updating the KnowledgeBase, you are just training a whole new Language anyways".Beta Was this translation helpful? Give feedback.
All reactions