-
Notifications
You must be signed in to change notification settings - Fork 30
[CEP 21] Run-exports in sharded Repodata #108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
wolfv
merged 6 commits into
conda:main
from
baszalmstra:run_exports_in_sharded_repodata
Mar 20, 2025
Merged
Changes from 2 commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
e4599bf
run_exports in sharded repodata
baszalmstra 5b491b1
run export patching
baszalmstra 97a7de5
Update cep-0016-2.md
baszalmstra 0231214
Update cep-0016-2.md
wolfv d1de8d9
finalize cep
baszalmstra f646217
small style fix
baszalmstra File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,87 @@ | ||
| <table> | ||
| <tr><td> Title </td><td> Run-exports in sharded Repodata. </td> | ||
| <tr><td> Status </td><td> Draft </td></tr> | ||
| <tr><td> Author(s) </td><td> Bas Zalmstra <[email protected]></td></tr> | ||
| <tr><td> Created </td><td> Jan 16, 2025</td></tr> | ||
| <tr><td> Updated </td><td> Jan 16, 2025</td></tr> | ||
| <tr><td> Discussion </td><td> </td></tr> | ||
| <tr><td> Implementation </td><td> </td></tr> | ||
| </table> | ||
|
|
||
| # Run-exports in sharded Repodata | ||
|
|
||
| We propose to add run-export information to the sharded repodata shards. | ||
|
|
||
| ## Motivation | ||
|
|
||
| When building conda packages the build infrastructure needs to extract run-export information from conda packages in the host- and build environments. | ||
| Run-export information is stored in a package and can be extracted by downloading the package and extracting the `run_exports.json` file. | ||
| Even with the possibility to stream parts of `.conda` files this is a relatively resource-intensive operation. | ||
|
|
||
| [CEP-12](https://github.com/conda/ceps/blob/main/cep-0012.md) formalized a `run_exports.json` file that is stored next `repodata.json` file. | ||
| However, not all channels on the default server (conda.anaconda.org) provide this information which means falling back to downloading and extracting this information from the packages. It is possible to extract the data by only sparsly reading the file but the overhead is still relatively large. | ||
|
|
||
| Having two separate files also poses some problems as extra mechanisms have to be introduced in the build infrastructure to manage and sync both files on the build machines. | ||
|
|
||
| ## Specification | ||
|
|
||
| CEP-12 mentions the following reasons for splitting the information into two files: | ||
|
|
||
| > * It would require extending the repodata schema, currently not formally standardized. | ||
| > * It would increase the size of the already heavy repodata.json files. | ||
| > * (Typed) repodata parsers would need to be updated to handle the new field. | ||
| We propose that these reasons no longer hold with [sharded repodata](https://github.com/conda/ceps/blob/main/cep-0016.md). | ||
|
|
||
| **It would require extending the repodata schema, currently not formally standardized.** | ||
|
|
||
| We propose to add a `run_export` field to each record that mimics the specification from CEP-12. | ||
|
|
||
| If the `run_export` field is not present in the record it means no `run_export` information is stored with the record, and a fallback mechanism should be used to acquire the run-export information. | ||
|
|
||
| Since *adding* a field will not break existing parsers we feel this is safe and does not require a schema change. | ||
|
|
||
| **It would increase the size of the already heavy repodata.json files.** | ||
|
|
||
| The size of the shards would grow if run-exports are added. | ||
|
|
||
| Let's take a look at the current sizes of run_exports.json and repodata.json files. | ||
|
|
||
| | channel + subdir | size of repo_data.json | size of run_exports.json | repodata.json.zst | run_exports.json.zst | | ||
| |------------------|------------------------|--------------------------|--------------------------|-----------------------------| | ||
| | conda-forge + linux-64 | 254 MB | 34.7 MB (11%) | 38.4 MB | 2.2MB (5%) | | ||
| | conda-forge + noarch | 107 MB | 13.6 MB (11%) | 16.7 MB | 0.9 MB (5%) | | ||
| | conda-forge + osx-arm64 | 99.8 MB | 11.8 MB (11%) | 12.6 MB | 0.8 MB (6%) | | ||
| | conda-forge + win-64 | 185 MB | 22.1 MB (11%) | 24.7 MB | 1.4 MB (5%) | | ||
|
|
||
| Since the repodata shards are also compressed we can conclude that in practice adding run exports information would increase the size of the repodata shards by roughly 5-6%. | ||
|
|
||
| With the introduction of sharded repodata in [CEP-16](https://github.com/conda/ceps/blob/main/cep-0016.md) the issues with size (and scale) have been effectively mitigated. Adding 5-6% to the total size of the shards will not pose a risk since all advantages of sharded repodata mentioned in the CEP still hold. | ||
|
|
||
| **(Typed) repodata parsers would need to be updated to handle the new field.** | ||
|
|
||
| Unless parsers are very strict about unknown fields (which was not a requirement for sharded repodata shards) this will not pose an issue. Since the absence of the field means that the state of the run-exports is unknown adding the field does not require a schema change. | ||
|
|
||
| ## Patching | ||
|
|
||
| With the run-exports part of the repodata we propose to also allow repodata patching these fields. The original run-exports can still be extracted from the packages. We do not see a reason how the run-export information is different from other patchable information stored in the repodata (like the dependencies). | ||
|
|
||
| To facilitate these patches a new file is added to the repodata-patches package called `run_exports_patch_instructions.json` so that `run_exports` can be modified per package: | ||
|
|
||
| ```json | ||
| { | ||
| "packages": { | ||
| "_libarchive_static_for_cph-3.3.3-h0921cf1_1.tar.bz2": { | ||
| "run_exports": { | ||
| "weak": ["libarchive >=3,<4"], | ||
| } | ||
| }, | ||
| ... | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
baszalmstra marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| For backwards compatibility reasons we propose *not* to add the patches to the existing `patch_instructions.json` file because older implementations do not have to ability to filter certain instructions. Without that ability the `run_exports` patches would end up in the `repodata.json` file which is undesirable. | ||
|
|
||
| For a channel to provide run-export patches, it MUST contain a `run_exports.json` file. | ||
| For build tools to support patched run exports it MUST always query the sharded repodata, or use a `run_exports.json` file as the source of truth for the run exports. | ||
| If neither sharded repodata or a `run_exports.json` file is present build tools can assume no run export patches exist. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replicating the
run_exportfield in the repodata would result in a lot of"run_exports": {}entries. These will compress well but would it be more efficient to store a top level field that indicates that records have an emptyrun_exportsunless they are explicitly declared? This would add complexity at the benefit of smaller(?) repodata.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We think that for shards, using
msgpack + zst, it's fine.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plus the shards have much improved caching behavior (content-addressed) so that total download will always be much much lower vs. the current situation.