[ENG-9831] Cedar metadata record creation/deletion in share for collection submission#11710
Conversation
aaxelb
left a comment
There was a problem hiding this comment.
mainly: tests
also, suggestions
| shtrove_ingest_url(), | ||
| params={ | ||
| 'focus_iri': iri, | ||
| 'record_identifier': f"CedarMetadataRecord:{cedar_record.guid._id}:{cedar_record.template.cedar_id}", |
There was a problem hiding this comment.
since it's important for the record_identifier to match when created/deleted, best move this to a function like _shtrove_cedar_record_identifier(...) (maybe just below the existing _shtrove_record_identifier) for use both places
also, tho i know i suggested this format in the first place, on reflection would prefer something like f'{cedar_record.guid._id}/CedarMetadataRecord:{cedar_record.template.cedar_id}' (to parallel the other supplementary record_identifiers that start with osfid/)
| graph = Graph() | ||
| iri = referent.get_semantic_iri() | ||
| full_metadata = { | ||
| '@id': iri, | ||
| OSF.hasCedarRecord: cedar_record.metadata, | ||
| } | ||
| graph.parse(data=full_metadata, format='json-ld') | ||
|
|
||
| serialized_data = graph.serialize(format='turtle') |
There was a problem hiding this comment.
these lines would make sense as a separate function with its own tests
| cedar_record = CedarMetadataRecord.objects.filter(pk=cedar_record_pk).first() | ||
| if not cedar_record: | ||
| return | ||
|
|
There was a problem hiding this comment.
if a CedarMetadataRecord is deleted, will this stop before sending the DELETE to shtrove? if a deleted record isn't still in the database, might be better for this task's args to take the cedar template id instead of the record id, so can build the record_identifier without having to load the cedar record
tests would help make sure, either way
| for cedar_record in self.guid.cedar_metadata_records.filter( | ||
| is_published=True, | ||
| template__should_index_for_search=True | ||
| ): | ||
| enqueue_task(share_update_cedar_metadata_record.s(self.guid._id, cedar_record.pk)) | ||
|
|
||
| for cedar_record in self.guid.cedar_metadata_records.filter( | ||
| models.Q(is_published=False) | models.Q(template__should_index_for_search=True) | ||
| ): | ||
| enqueue_task(share_delete_cedar_metadata_record.s(self.guid._id, cedar_record.pk)) | ||
|
|
There was a problem hiding this comment.
enqueueing tasks from the model here seems brittle... there are other times when these would probably be expected to get indexed for search (e.g. management commands or osf-admin buttons reindexing the object described by the cedar record)
would be more reliable to move this to a followup task after task__update_share completes -- perhaps enqueued here, when _next_partition is None: https://github.com/ihorsokhanexoft/osf.io/blob/06ef8687d1cb004c7eff54c7434e0e26006ef276/api/share/utils.py#L176
(i understand wanting to avoid redundant updates, and the whole update_share chain could be optimized to better avoid redundant updates, but it's not terribly expensive and better to send too often than not enough)
| enqueue_task(task__update_share.s(_osfguid_value)) | ||
|
|
||
|
|
||
| @celery_app.task() |
There was a problem hiding this comment.
these new tasks should both have similar retry behavior to task__update_share when the request fails -- could move this block to a reusable function https://github.com/ihorsokhanexoft/osf.io/blob/06ef8687d1cb004c7eff54c7434e0e26006ef276/api/share/utils.py#L152-L171
aaxelb
left a comment
There was a problem hiding this comment.
looking better! two small requests
appreciate the testing effort (tho it could have been easier)
| for url in urls_to_find.keys(): | ||
| urls_to_find[url] = result[result.index(url) - 3] | ||
|
|
||
| # fetch urls from result and assign ns prefixes based on order of appearance in result to make test resilient to changes in order of namespace declaration in turtle output |
There was a problem hiding this comment.
ohh yeah i forgot this disappointing part of rdflib, the unstable turtle serializer -- but instead of complicating test logic like this (since namespace prefixes and object ordering don't really matter) i've found it better to assert on the graph parsed from the given turtle
maybe move this _assert_equivalent_turtle method to reusable assert_equivalent_turtle (with passthru label kwarg instead of filename) in osf_tests/metadata/_utils.py and import from there, then write this test with a simpler "expected" block string (without having to match namespaces by hand)
| for cedar_record in _osfid_instance.cedar_metadata_records.filter( | ||
| is_published=True, | ||
| template__should_index_for_search=True, | ||
| ): | ||
| enqueue_task(share_update_cedar_metadata_record.s(_osfid_instance._id, cedar_record.pk)) | ||
|
|
||
| for cedar_record in _osfid_instance.cedar_metadata_records.filter( | ||
| Q(is_published=False) | Q(template__should_index_for_search=False), | ||
| ): | ||
| enqueue_task( | ||
| share_delete_cedar_metadata_record.s( | ||
| cedar_record.guid._id, | ||
| cedar_record._id, | ||
| cedar_record.template.cedar_id, | ||
| ), | ||
| ) |
There was a problem hiding this comment.
these'll be run multiple times, for each osfmap partition -- probably want them in an else block, run when _next_partition is None (after the last osfmap partition has been sent)
also for cleanliness sake, would make sense in a separate function
_next_partition = _next_osfmap_partition(_osfmap_partition)
if _next_partition is not None:
...
else: # schedule non-osfmap supplements
_schedule_cedar_record_updates(_osfid_instance)
https://openscience.atlassian.net/browse/ENG-9831