-
Notifications
You must be signed in to change notification settings - Fork 421
fix: correct repodata_record.json metadata for explicit installs (#4095) #4110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add comprehensive tests verifying that PackageInfo::from_url() populates the defaulted_keys field to mark which fields have stub/default values. Tests added for: - Conda packages (.conda, .tar.bz2): Standard stub fields - Wheel packages (.whl): Standard + build/build_string - TarGz packages (.tar.gz): Same as wheels - Git URLs with #egg=: Most fields defaulted (minimal info from URL) - Git URLs WITHOUT #egg=: Also includes 'name' in defaulted_keys Each test verifies "_initialized" sentinel presence (required for all paths) plus appropriate stub field markers. Test Status: ALL TESTS FAIL (expected RED phase) - test cases: 1 | 1 failed - assertions: 6 | 6 failed (all for missing _initialized sentinel) Related: mamba-org#4095
…REEN)
Populate the defaulted_keys field in PackageInfo creation paths to mark
which fields have stub/default values. This enables write_repodata_record()
to erase these stubs before merging with index.json.
The "_initialized" sentinel is placed first in defaulted_keys for all paths.
This enables fail-hard verification: write_repodata_record() will throw if
_initialized is missing, catching any code path that fails to set defaulted_keys.
Changes in from_url() (package_info.cpp):
- Conda (.conda, .tar.bz2): _initialized + build_number, license, timestamp,
track_features, depends, constrains
- Wheel (.whl): Above + build, build_string (not in wheel filename)
- TarGz (.tar.gz): Same as wheels
- Git URLs: _initialized + version, channel, subdir, fn, build, build_string,
build_number, license, timestamp, track_features, depends, constrains
- Git URLs WITHOUT #egg=: Also includes 'name' (cannot be derived from URL)
Changes in make_package_info() (helpers.cpp):
- Set defaulted_keys = {"_initialized"} for solver-derived packages
- Solver packages have authoritative metadata from channel repodata
- Only _initialized signals "trust all fields" (including patches)
Updated header comment (package_info.hpp):
- Removed outdated WARNING about make_package_info not setting defaulted_keys
- New comment accurately describes the current behavior
The asymmetry is intentional:
- URL-derived: list stub fields to erase
- Solver-derived: trust all fields, erase nothing
Test Status: ALL TESTS PASS (GREEN phase)
- test cases: 3 | 3 passed
- assertions: 164 | 164 passed
Tests fixed:
- PackageInfo::from_url populates defaulted_keys (all 6 sections)
Related: mamba-org#4095
…on (RED)
Add tests demonstrating bugs in write_repodata_record():
C++ Tests (test_package_fetcher.cpp):
1. URL-derived metadata test:
- Creates PackageInfo from URL (has stub defaults)
- Creates tarball with correct index.json values
- EXPECTS: repodata_record.json has correct values from index.json
- Tests: license, timestamp, build_number
2. Channel patch preservation (depends):
- Creates solver-derived PackageInfo with depends=[]
- defaulted_keys = {"_initialized"} (only sentinel = trust all)
- EXPECTS: Empty depends preserved in repodata_record.json
- Verifies channel patches are NOT overwritten by index.json
3. Channel patch preservation (constrains):
- Same as above but for constrains field
Python Integration Tests (test_constructor.py):
1. test_url_derived_metadata_from_index_json
2. test_consistent_field_presence
3. test_track_features_omitted_when_empty
Note: Python tests PASS because constructor.cpp uses index.json as base
when no cached repodata exists. This validates the existing behavior.
Python test improvements:
- Use try-finally in teardown for robust env var restoration
Test Status:
- C++ tests: 3 | 0 passed | 3 failed (RED - demonstrates bugs)
- Python tests: 3 | 3 passed (validates assumptions)
Related: mamba-org#4095
…p (GREEN)
Fix URL-derived package corruption and preserve channel patches by using
the defaulted_keys mechanism in both write_repodata_record() and constructor.cpp.
Implementation in package_fetcher.cpp:
1. FAIL-HARD CHECK: Verify "_initialized" sentinel is present
- Throws std::logic_error if missing
- Catches any code path that fails to set defaulted_keys
- Re-throw std::logic_error from extract() to propagate programming bugs
2. ERASE STUB FIELDS: Remove fields listed in defaulted_keys before merge
- URL-derived: erases build_number, license, timestamp, etc.
- Solver-derived: defaulted_keys = {"_initialized"}, nothing erased
- This preserves channel patches with intentionally empty arrays
3. MERGE WITH INDEX.JSON: insert() fills erased fields
- URL-derived: gets correct values from index.json
- Solver-derived: all fields already present, index.json ignored
Implementation in constructor.cpp:
Same defaulted_keys erasure logic BUT no fail-hard check (intentional asymmetry).
Tests added:
- PackageFetcher::write_repodata_record fails without _initialized
Test Status: ALL TESTS PASS (GREEN phase)
- C++ tests: 4 | 4 passed (21 assertions)
- Python tests: 3 | 3 passed
Tests fixed:
- write_repodata_record uses index.json for URL-derived metadata
- write_repodata_record preserves channel patched empty depends
- write_repodata_record preserves channel patched empty constrains
Related: mamba-org#4095
Add test demonstrating that EXISTING corrupted caches (from v2.1.1-v2.4.0) are NOT detected and healed by current code. The bug: has_valid_extracted_dir() only validates checksums, not metadata correctness. Corrupted caches with stub values (timestamp=0, license="") are considered valid and used as-is. Test scenario: 1. Create package tarball with CORRECT index.json (MIT license, etc.) 2. Create CORRUPTED repodata_record.json in cache (simulating bug) - timestamp=0, license="", build_number=0 3. Create PackageFetcher 4. EXPECT: needs_extract() returns true (corruption detected) 5. ACTUAL: returns false (corruption NOT detected) Healing flow (after fix): 1. has_valid_extracted_dir() detects corruption signature 2. Returns valid=false → needs_extract() returns true 3. Package re-extracted from tarball 4. write_repodata_record() writes correct values Test Status: TEST FAILS (expected RED phase) - test cases: 5 | 4 passed | 1 failed - assertions: 23 | 22 passed | 1 failed - needs_extract() returns false (should return true) Related: mamba-org#4095
Add corruption detection to has_valid_extracted_dir() to heal caches corrupted by v2.1.1-v2.4.0 bug. Corruption signature: timestamp == 0 AND license == "" When detected: 1. Log info message about the corruption detection 2. Return valid=false to invalidate the cache entry 3. This triggers re-extraction from the tarball 4. write_repodata_record() then writes correct values using index.json Why this signature is reliable (low false-positive risk): - timestamp=0 is extremely rare for legitimate packages (build tools set it) - Both conditions must be true (reduces false positives significantly) - license="" alone could be legitimate (some packages have no license) - timestamp=0 alone could theoretically be legitimate - Together they reliably indicate v2.1.1-v2.4.0 corruption Even if a false positive occurs: - Only consequence is unnecessary re-extraction (wasted work) - NOT data corruption - re-extracted package has correct values - This is acceptable given the rarity of the combination Added is_number()/is_string() type checks for defensive programming to prevent potential type_error exceptions if fields have unexpected types. Test Status: ALL TESTS PASS (GREEN phase) - test cases: 5 | 5 passed - assertions: 28 | 28 passed Tests fixed: - PackageFetcher heals existing corrupted cache Related: mamba-org#4095
Add tests verifying repodata_record.json consistency and checksum handling: New Tests: 1. depends/constrains presence test: - Tests that depends and constrains are always present as arrays - Even when index.json lacks these fields (like nlohmann_json-abi) - Matches conda behavior for field consistency - Status: FAILS (depends key not present when missing from index.json) 2. track_features omission test: - Tests that track_features is omitted when empty - Matches conda behavior to reduce JSON noise - Status: PASSES (current behavior already correct) 3. both checksums presence test: - Tests that both md5 and sha256 are always present - Should compute checksums from tarball when not available - Status: FAILS (md5/sha256 not computed when missing) 4. noarch backfill test: - Tests that noarch field is backfilled from index.json - Validates existing merge behavior for conditionally-written fields - Status: PASSES (validates assumption) 5. size from tarball test: - Tests that size=0 is filled from actual tarball file size - Validates existing special-case handling - Status: PASSES (validates assumption) 6. no false positive healing test: - Tests that timestamp=0 with license="MIT" is NOT healed - Verifies corruption signature requires BOTH conditions - Status: PASSES (validates healing specificity) Test Status: PARTIAL FAILURES (expected RED phase) - test cases: 11 | 9 passed | 2 failed - assertions: 53 | 51 passed | 2 failed - depends/constrains: FAIL - checksums: FAIL - Others: PASS (validate assumptions) Related: mamba-org#4095
…GREEN) - Add logic to ensure depends and constrains are always present as arrays - Ensure track_features is omitted when empty (not empty string) - Compute and fill md5/sha256 checksums from tarball when missing - Apply consistent field handling to constructor.cpp This matches conda behavior where: - depends and constrains are always present (even as empty arrays) - track_features is only included when non-empty - Both checksums are always available Test Status: ALL TESTS PASS (GREEN phase) - C++ test cases: 11 | 11 passed - C++ assertions: 61 | 61 passed - Python test cases: 3 | 3 passed Tests fixed: - PackageFetcher::write_repodata_record ensures depends/constrains present - PackageFetcher::write_repodata_record ensures both checksums Related: mamba-org#4095
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4110 +/- ##
==========================================
+ Coverage 63.47% 64.25% +0.77%
==========================================
Files 315 315
Lines 38632 39363 +731
Branches 2966 2991 +25
==========================================
+ Hits 24521 25292 +771
+ Misses 14042 14002 -40
Partials 69 69 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…d_keys (RED) Add tests demonstrating that lockfile-parsed PackageInfo objects are missing the "_initialized" sentinel in defaulted_keys for both: - conda-lock YAML format (env_lockfile_conda.cpp) - mambajs JSON format (env_lockfile_mambajs.cpp) PURPOSE: Verify that PackageInfo objects created from lockfile parsing have the "_initialized" sentinel, which is required for fail-hard verification in write_repodata_record(). MOTIVATION: Issue mamba-org#4095 - after refactoring env_lockfile.cpp into separate files for conda and mambajs formats, both code paths need to properly set defaulted_keys: - Conda format: calls from_url() but wasn't copying defaulted_keys - Mambajs format: builds PackageInfo manually without setting defaulted_keys Test Status: BOTH TESTS FAIL (expected RED phase) - test cases: 2 | 2 failed - assertions: 6 | 4 passed | 2 failed Related: mamba-org#4095
Fix both lockfile parsing implementations to properly set defaulted_keys
with the "_initialized" sentinel required by write_repodata_record().
Changes:
- env_lockfile_conda.cpp: Copy defaulted_keys from from_url() result
- env_lockfile_mambajs.cpp: Set defaulted_keys = {"_initialized"}
(lockfile data is authoritative, so no fields are "defaulted")
This fix is required after the env_lockfile.cpp refactoring that split
the implementation into separate files for conda and mambajs formats.
Test Status: ALL TESTS PASS (GREEN phase)
- lockfile_packages_have_initialized_in_defaulted_keys-conda: PASSED
- lockfile_packages_have_initialized_in_defaulted_keys-mambajs: PASSED
- All existing tests: PASSED
Tests fixed:
- test_lockfile[condalock-True] (Python integration test that was failing in CI)
Related: mamba-org#4095
82b18da to
da17b04
Compare
Add comprehensive tests verifying that defaulted_keys can be stored in and retrieved from a libsolv solvable via the solv-cpp wrapper. Tests added for: - Basic storage and retrieval of defaulted_keys list - Empty defaulted_keys preservation - Overwriting defaulted_keys - Single element (the common case for channel-derived packages) PURPOSE: The defaulted_keys field must survive the round-trip through libsolv when a PackageInfo is added to the solver database and later retrieved from the solution. This test ensures the wrapper provides the necessary storage mechanism. MOTIVATION: Issue mamba-org#4095 - URL-derived packages lose their defaulted_keys when going through the solver because libsolv doesn't natively store this field. The fix will use SOLVABLE_KEYWORDS to store a serialized representation. Test Status: COMPILATION FAILS (expected RED phase) - error: 'ObjSolvableView' has no member named 'set_defaulted_keys' - error: 'ObjSolvableView' has no member named 'defaulted_keys' Related: mamba-org#4095
Implement set_defaulted_keys() and defaulted_keys() methods in the libsolv solvable wrapper. The vector of keys is serialized as a comma-separated string using the SOLVABLE_KEYWORDS field for storage. SOLVABLE_KEYWORDS is repurposed for conda-specific metadata since it's not used in the conda ecosystem. This follows the pattern of other conda-specific fields stored in libsolv solvables. Implementation details: - set_defaulted_keys(): Serializes vector to "key1,key2,key3" format - defaulted_keys(): Parses comma-separated string back to vector - Empty list stores empty string (distinguishes from "not set") - Requires internalize() call before values are readable This allows the defaulted_keys field to survive the round-trip through libsolv when a PackageInfo is added to the solver database and later retrieved from the solution. Test Status: ALL TESTS PASS (GREEN phase) - test cases: 1 | 1 passed - assertions: 117 | 117 passed (including 6 new for defaulted_keys) Tests fixed: - Create a solvable > Set and get defaulted_keys (all 4 sections) Related: mamba-org#4095
…repodata (RED) Add comprehensive tests for defaulted_keys handling in the libsolv Database: 1. "Database preserves defaulted_keys" - Tests round-trip preservation: - Full defaulted_keys list survives database add/retrieve (FAILS) - Empty defaulted_keys becomes ["_initialized"] via fallback (passes) - Single _initialized is preserved (passes) 2. "Channel-derived packages from repodata have _initialized" - Tests that packages loaded from channel repodata JSON have ["_initialized"] (passes) The first test case reveals the bug: when a PackageInfo with a full defaulted_keys list (like from from_url()) is added via add_repo_from_packages() and retrieved via for_each_package_in_repo(), the list is lost and replaced with just ["_initialized"]. This is the root cause of issue mamba-org#4095: URL-derived packages lose their metadata tracking when going through the solver, causing write_repodata_record() to preserve stub values instead of replacing them with index.json data. SEMANTICS of defaulted_keys: - Empty: INVALID (missing "_initialized" sentinel) - ["_initialized"]: Properly initialized, trust all fields - ["_initialized", "field1", ...]: These fields have stub values to replace Test Status: 1 FAILING, 3 PASSING (expected RED phase) - test cases: 4 | 3 passed | 1 failed - assertions: 79 | 78 passed | 1 failed - FAILING: "defaulted_keys survives round-trip through database" Related: mamba-org#4095
Implement the complete fix for issue mamba-org#4095 by ensuring defaulted_keys survives the round-trip through libsolv and is properly initialized for all package sources. Changes in helpers.cpp: 1. set_solvable(PackageInfo) - Store defaulted_keys in solvable: Added `solv.set_defaulted_keys(pkg.defaulted_keys)` to preserve the field when a PackageInfo is added to the solver database. 2. make_package_info() - Read from solvable with fallback: Changed from always setting ["_initialized"] to reading from the solvable via `s.defaulted_keys()`. Added fallback to ["_initialized"] for backward compatibility with old .solv cache files that don't have defaulted_keys stored. 3. set_solvable(JSON template) - Set ["_initialized"] for channel repodata: Added `solv.set_defaulted_keys({ "_initialized" })` to mark packages loaded from channel repodata JSON as having authoritative metadata. This is required for write_repodata_record() to work correctly. Changes in specs.cpp (Python bindings): 4. PackageInfo constructor - Default to ["_initialized"]: Changed the default value from empty vector to ["_initialized"] for backward compatibility. Before this fix, make_package_info() always returned PackageInfo with ["_initialized"], so Python users creating PackageInfo manually should get the same default behavior. The fix ensures: - URL-derived packages: Full defaulted_keys list preserved through solver - Channel repodata packages: Have ["_initialized"] (trust all fields) - Old .solv cache files: Fallback to ["_initialized"] - Python-created PackageInfo: Default to ["_initialized"] Test Status: ALL TESTS PASS (GREEN phase) - test cases: 552 | 551 passed | 1 skipped - assertions: 27806 | 27806 passed Tests fixed: - Database preserves defaulted_keys (all 3 sections) - Channel-derived packages from repodata have _initialized Closes: mamba-org#4095
|
238e7ea is green up to a flaky connection error Still todo: more thoroughly vet the last 6 commits, and revise the commit description accordingly. Brief summary: I'm enabling libsolv to propagate |
I do not know C++. This PR was extremely heavily LLM-assisted (Claude in Cursor). As a result, please review this PR skeptically and critically. I have spent significantly more time than typical attempting to verify correctness to the extent I'm able.
Description
This PR fixes #4095: incorrect
repodata_record.jsonprecedence for explicit installs.Background
Since v2.1.1, libmamba switched to "repodata-first" when constructing
info/repodata_record.json(intended to honor channel repodata patches overinfo/index.json). However, for explicit installs (URL-based), the "repodata" object is a URL-derived stub containing placeholder/zero values—not real channel repodata. Making that stub authoritative causedrepodata_record.jsonto be written with incomplete/zeroed fields.Affected versions:
depends/constrainsfixed, butlicense,timestamp,build_number,track_featuresstill corrupted)The v2.3.3 partial fix (#4071) removed empty
depends/constrainsarrays so they could be backfilled fromindex.json. However, this approach has two problems:depends/constrainsto[]are silently overwritten byindex.jsonvaluesbuild_number,license,timestamp, andtrack_featuresremained corruptedSolution
This PR leverages the existing
defaulted_keysfield (originally introduced in PR #1120 for signature verification, but no longer populated since the 2023 libsolv wrapper refactor) to distinguish URL-derived stub values from authoritative solver/channel values:URL-derived packages (
from_url()):defaulted_keyslists stub field names (e.g.,["_initialized", "license", "timestamp", ...]). These fields are erased before merging withindex.json, allowing correct values to be filled in.Solver-derived packages (
make_package_info()):defaulted_keys = ["_initialized"]only. The sentinel signals "trust ALL fields"—nothing is erased, preserving channel patches including intentionally empty arrays.The
_initializedsentinel enables fail-hard verification:write_repodata_record()throwsstd::logic_errorif it's missing, catching any code path that fails to properly initializedefaulted_keys.Additionally, this PR adds cache healing to detect and fix corrupted caches from v2.1.1–v2.4.0 using a corruption signature (
timestamp == 0 AND license == "").Review Recommendation
Please review commit-by-commit. Each commit has an extended commit message with:
The commits follow TDD (Test-Driven Development) with alternating RED (failing tests) and GREEN (fix) commits:
defaulted_keyspopulation infrom_url()defaulted_keysinfrom_url()andmake_package_info()defaulted_keysinwrite_repodata_record()andconstructor.cpphas_valid_extracted_dir()Commits 07-08 add consistency improvements: ensuring
depends/constrainsare always present as arrays, omittingtrack_featureswhen empty (matching conda behavior), and computing missing checksums from the tarball.Type of Change
Checklist
pre-commit run --alllocally in the source folder and confirmed that there are no linter errors.