-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Problem
After a merge, a merged entity has multiple names in its occurrence pool (e.g. "Meridian Technologies" and "Meridian Tech"), but only the most-frequent canonical name is stored on Entity.name and embedded. The other names are silently dropped.
In prepare_embeddings, only the canonical name is embedded:
all_names = sorted({e.name for g in graphs for e in g.entities.values()})In propagate, similarity is initialised from that single name:
name_a = graph_a.entities[eid_a].name
emb_a = name_embeddings.get(name_a)This means if an already-merged entity's canonical name is not the closest match to a name in another graph, similarity is under-estimated.
Correct behaviour
For entities with multiple known names, the initial similarity between two entities should be:
max(cosine_sim(emb_a_i, emb_b_j) for emb_a_i in all_embs_a for emb_b_j in all_embs_b)
Fix
- Add a
names: set[str]field toEntity(alongside the canonicalname). - Populate it from all occurrence names at load/merge time.
- In
prepare_embeddings, embed all names (not just canonical ones). - In
propagate, compute initial sigma as max pairwise similarity across all name embeddings for each entity pair.
When this matters
The current pipeline runs one pass of matching on the original per-article graphs (which are always single-source, so single-name). The bug only bites if merged graphs are fed back into a second matching pass (iterative refinement). It's latent today but will silently degrade quality if iterative matching is added.