Replies: 2 comments
-
|
FWIW, here are GPT's thoughts on all this: Understanding Sequential Non-Overlapping GitHub Artifact IDsThis pattern makes sense given how GitHub IDs behaved circa 2022–2023. What’s Likely Happening
Quick Validations to Run-- 1) Identify ID “families” by prefix/length
SELECT
to_artifact_source_id,
LENGTH(to_artifact_source_id) AS len,
SUBSTR(to_artifact_source_id,1,8) AS prefix,
MIN(time) AS first_seen,
MAX(time) AS last_seen,
COUNT(*) AS n_events
FROM int_events__github
WHERE to_artifact_id = '<artifact_id>'
GROUP BY 1,2,3
ORDER BY first_seen-- 2) Check correlation with event producers
SELECT
DATE_TRUNC('day',time) AS d,
ANY_VALUE(event_source_id) IGNORE NULLS,
COUNT(*) AS n,
APPROX_DISTINCT(to_artifact_source_id) AS n_ids
FROM int_events__github
WHERE to_artifact_namespace='<owner>' AND to_artifact_name='<repo>'
GROUP BY 1
ORDER BY 1You’ll typically see two or more ID families (e.g., old base64 vs new Recommended Remediation
Helper Function for Alias Rangesdef build_node_id_alias_ranges(df):
"""
df: columns ['repo_url','to_artifact_source_id','time']
returns: alias ranges per node_id
"""
d = df.copy()
d['time'] = pd.to_datetime(d['time'])
g = (
d.sort_values('time')
.groupby(['repo_url','to_artifact_source_id'])
.agg(valid_from=('time','min'),
valid_to=('time','max'),
n=('time','size'))
.reset_index()
.sort_values(['repo_url','valid_from'])
)
return gBottom LineYour observation is consistent with GitHub ID migration artifacts leaking into event streams. |
Beta Was this translation helpful? Give feedback.
-
|
Discussion from 2025-10-28: So one part of this problem is that we our |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
In March 2025, we made an important change to standardize how we arrive at an
artifact_idacross different sources. Previously, in the case of GitHub data -- and only in the case of GitHub data -- we included theartifact_source_idin the hashing function. This id mapped to thenode_idfield in gharchive data.Back then, the primary implication is that you would NOT be able to derive the OSO
artifact_idfrom a GitHub URL. You would need to ping the GitHub API or do some historic lookup to get thenode_id, and then derive your OSOartifact_id. As result, it was difficult to join different registries of GitHub URLs (eg, OSSD and OP Atlas or Crypto Ecosystems).@ravenac95 asked about this again in October 2025, and here was my response in Discord:
artifacts_by_project_v1for https://github.com/opensource-observer/oso, you should also get https://github.com/hypercerts-org/oso as an artifact that we own.Yesterday, I started doing something along these lines to track repo lineage / renaming events and noticed something even stranger (see notebook here if you want to play around).
Here's a good example:
There are 10 distinct
artifact_source_idfor the same repo over this two month period.If we take each
artifact_source_idand its corresponding minimum and maximum timestamps, we can see that all intervals appear sequential and non-overlapping (each min_time begins after the previous max_time ends).GitHub does NOT have different repo ids for branches of the same repo.
However, I did find a discussion about GitHub at one point (around this time) about this very issue:
I'd have to do more digging to see how pervasive this issue is. In the meantime, it seems like moving to the URL-based method of deriving an OSO
artifact_idwas definitely the right call!Beta Was this translation helpful? Give feedback.
All reactions