Skip to content

[ENG-9831] Cedar metadata record creation/deletion in share for collection submission#11710

Open
ihorsokhanexoft wants to merge 9 commits intoCenterForOpenScience:feature/es2-consolidationfrom
ihorsokhanexoft:feature/ENG-9831
Open

[ENG-9831] Cedar metadata record creation/deletion in share for collection submission#11710
ihorsokhanexoft wants to merge 9 commits intoCenterForOpenScience:feature/es2-consolidationfrom
ihorsokhanexoft:feature/ENG-9831

Conversation

@ihorsokhanexoft
Copy link
Copy Markdown
Contributor

@ihorsokhanexoft ihorsokhanexoft commented Apr 23, 2026

@adlius adlius requested a review from aaxelb April 29, 2026 17:31
Copy link
Copy Markdown
Contributor

@aaxelb aaxelb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mainly: tests

also, suggestions

Comment thread api/share/utils.py Outdated
shtrove_ingest_url(),
params={
'focus_iri': iri,
'record_identifier': f"CedarMetadataRecord:{cedar_record.guid._id}:{cedar_record.template.cedar_id}",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it's important for the record_identifier to match when created/deleted, best move this to a function like _shtrove_cedar_record_identifier(...) (maybe just below the existing _shtrove_record_identifier) for use both places

also, tho i know i suggested this format in the first place, on reflection would prefer something like f'{cedar_record.guid._id}/CedarMetadataRecord:{cedar_record.template.cedar_id}' (to parallel the other supplementary record_identifiers that start with osfid/)

Comment thread api/share/utils.py Outdated
Comment on lines +79 to +87
graph = Graph()
iri = referent.get_semantic_iri()
full_metadata = {
'@id': iri,
OSF.hasCedarRecord: cedar_record.metadata,
}
graph.parse(data=full_metadata, format='json-ld')

serialized_data = graph.serialize(format='turtle')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these lines would make sense as a separate function with its own tests

Comment thread api/share/utils.py Outdated
cedar_record = CedarMetadataRecord.objects.filter(pk=cedar_record_pk).first()
if not cedar_record:
return

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a CedarMetadataRecord is deleted, will this stop before sending the DELETE to shtrove? if a deleted record isn't still in the database, might be better for this task's args to take the cedar template id instead of the record id, so can build the record_identifier without having to load the cedar record

tests would help make sure, either way

Comment thread osf/models/provider.py
Comment thread osf/models/collection_submission.py Outdated
Comment on lines +469 to +479
for cedar_record in self.guid.cedar_metadata_records.filter(
is_published=True,
template__should_index_for_search=True
):
enqueue_task(share_update_cedar_metadata_record.s(self.guid._id, cedar_record.pk))

for cedar_record in self.guid.cedar_metadata_records.filter(
models.Q(is_published=False) | models.Q(template__should_index_for_search=True)
):
enqueue_task(share_delete_cedar_metadata_record.s(self.guid._id, cedar_record.pk))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enqueueing tasks from the model here seems brittle... there are other times when these would probably be expected to get indexed for search (e.g. management commands or osf-admin buttons reindexing the object described by the cedar record)

would be more reliable to move this to a followup task after task__update_share completes -- perhaps enqueued here, when _next_partition is None: https://github.com/ihorsokhanexoft/osf.io/blob/06ef8687d1cb004c7eff54c7434e0e26006ef276/api/share/utils.py#L176

(i understand wanting to avoid redundant updates, and the whole update_share chain could be optimized to better avoid redundant updates, but it's not terribly expensive and better to send too often than not enough)

Comment thread api/share/utils.py Outdated
enqueue_task(task__update_share.s(_osfguid_value))


@celery_app.task()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these new tasks should both have similar retry behavior to task__update_share when the request fails -- could move this block to a reusable function https://github.com/ihorsokhanexoft/osf.io/blob/06ef8687d1cb004c7eff54c7434e0e26006ef276/api/share/utils.py#L152-L171

@ihorsokhanexoft ihorsokhanexoft requested a review from aaxelb May 5, 2026 12:18
Copy link
Copy Markdown
Contributor

@aaxelb aaxelb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking better! two small requests

appreciate the testing effort (tho it could have been easier)

for url in urls_to_find.keys():
urls_to_find[url] = result[result.index(url) - 3]

# fetch urls from result and assign ns prefixes based on order of appearance in result to make test resilient to changes in order of namespace declaration in turtle output
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh yeah i forgot this disappointing part of rdflib, the unstable turtle serializer -- but instead of complicating test logic like this (since namespace prefixes and object ordering don't really matter) i've found it better to assert on the graph parsed from the given turtle

maybe move this _assert_equivalent_turtle method to reusable assert_equivalent_turtle (with passthru label kwarg instead of filename) in osf_tests/metadata/_utils.py and import from there, then write this test with a simpler "expected" block string (without having to match namespaces by hand)

Comment thread api/share/utils.py
Comment on lines +193 to +208
for cedar_record in _osfid_instance.cedar_metadata_records.filter(
is_published=True,
template__should_index_for_search=True,
):
enqueue_task(share_update_cedar_metadata_record.s(_osfid_instance._id, cedar_record.pk))

for cedar_record in _osfid_instance.cedar_metadata_records.filter(
Q(is_published=False) | Q(template__should_index_for_search=False),
):
enqueue_task(
share_delete_cedar_metadata_record.s(
cedar_record.guid._id,
cedar_record._id,
cedar_record.template.cedar_id,
),
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these'll be run multiple times, for each osfmap partition -- probably want them in an else block, run when _next_partition is None (after the last osfmap partition has been sent)

also for cleanliness sake, would make sense in a separate function

_next_partition = _next_osfmap_partition(_osfmap_partition)
if _next_partition is not None:
   ...
else:  # schedule non-osfmap supplements
   _schedule_cedar_record_updates(_osfid_instance)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants