fix: correct repodata_record.json metadata for explicit installs (#4095) #4110

maresb · 2025-11-29T21:13:05Z

⚠️ Important Disclosure

I do not know C++. This PR was extremely heavily LLM-assisted (Claude in Cursor). As a result, please review this PR skeptically and critically. I have spent significantly more time than typical attempting to verify correctness to the extent I'm able.

Description

This PR fixes #4095: incorrect repodata_record.json precedence for explicit installs.

Background

Since v2.1.1, libmamba switched to "repodata-first" when constructing info/repodata_record.json (intended to honor channel repodata patches over info/index.json). However, for explicit installs (URL-based), the "repodata" object is a URL-derived stub containing placeholder/zero values—not real channel repodata. Making that stub authoritative caused repodata_record.json to be written with incomplete/zeroed fields.

Affected versions:

v2.1.1–v2.3.2: Full corruption (all stub fields written)
v2.3.3–v2.4.0: Partial fix via Fix empty depends/constrains when installing explicit spec files #4071 (depends/constrains fixed, but license, timestamp, build_number, track_features still corrupted)

The v2.3.3 partial fix (#4071) removed empty depends/constrains arrays so they could be backfilled from index.json. However, this approach has two problems:

Channel patches undone: Patches that intentionally set depends/constrains to [] are silently overwritten by index.json values
Other fields still wrong: build_number, license, timestamp, and track_features remained corrupted

Solution

This PR leverages the existing defaulted_keys field (originally introduced in PR #1120 for signature verification, but no longer populated since the 2023 libsolv wrapper refactor) to distinguish URL-derived stub values from authoritative solver/channel values:

URL-derived packages (from_url()): defaulted_keys lists stub field names (e.g., ["_initialized", "license", "timestamp", ...]). These fields are erased before merging with index.json, allowing correct values to be filled in.
Solver-derived packages (make_package_info()): defaulted_keys = ["_initialized"] only. The sentinel signals "trust ALL fields"—nothing is erased, preserving channel patches including intentionally empty arrays.

The _initialized sentinel enables fail-hard verification: write_repodata_record() throws std::logic_error if it's missing, catching any code path that fails to properly initialize defaulted_keys.

Additionally, this PR adds cache healing to detect and fix corrupted caches from v2.1.1–v2.4.0 using a corruption signature (timestamp == 0 AND license == "").

Review Recommendation

Please review commit-by-commit. Each commit has an extended commit message with:

Detailed explanation of the changes
Test status (assertions passed/failed)
Rationale for design decisions

The commits follow TDD (Test-Driven Development) with alternating RED (failing tests) and GREEN (fix) commits:

Commit	Type	Description
01	RED	Tests for `defaulted_keys` population in `from_url()`
02	GREEN	Populate `defaulted_keys` in `from_url()` and `make_package_info()`
03	RED	Tests for URL-derived metadata and channel patch preservation
04	GREEN	Use `defaulted_keys` in `write_repodata_record()` and `constructor.cpp`
05	RED	Test for cache healing
06	GREEN	Detect corrupted cache in `has_valid_extracted_dir()`
07	RED	Tests for consistent field presence and checksums
08	GREEN	Ensure consistent field presence and compute missing checksums

Commits 07-08 add consistency improvements: ensuring depends/constrains are always present as arrays, omitting track_features when empty (matching conda behavior), and computing missing checksums from the tarball.

Type of Change

Bugfix

Checklist

My code follows the general style and conventions of the codebase, ensuring consistency
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
My changes generate no new warnings
I have run pre-commit run --all locally in the source folder and confirmed that there are no linter errors.
I have added tests that prove my fix is effective or that my feature works
New and existing tests pass locally with my changes

Add comprehensive tests verifying that PackageInfo::from_url() populates the defaulted_keys field to mark which fields have stub/default values. Tests added for: - Conda packages (.conda, .tar.bz2): Standard stub fields - Wheel packages (.whl): Standard + build/build_string - TarGz packages (.tar.gz): Same as wheels - Git URLs with #egg=: Most fields defaulted (minimal info from URL) - Git URLs WITHOUT #egg=: Also includes 'name' in defaulted_keys Each test verifies "_initialized" sentinel presence (required for all paths) plus appropriate stub field markers. Test Status: ALL TESTS FAIL (expected RED phase) - test cases: 1 | 1 failed - assertions: 6 | 6 failed (all for missing _initialized sentinel) Related: mamba-org#4095

…REEN) Populate the defaulted_keys field in PackageInfo creation paths to mark which fields have stub/default values. This enables write_repodata_record() to erase these stubs before merging with index.json. The "_initialized" sentinel is placed first in defaulted_keys for all paths. This enables fail-hard verification: write_repodata_record() will throw if _initialized is missing, catching any code path that fails to set defaulted_keys. Changes in from_url() (package_info.cpp): - Conda (.conda, .tar.bz2): _initialized + build_number, license, timestamp, track_features, depends, constrains - Wheel (.whl): Above + build, build_string (not in wheel filename) - TarGz (.tar.gz): Same as wheels - Git URLs: _initialized + version, channel, subdir, fn, build, build_string, build_number, license, timestamp, track_features, depends, constrains - Git URLs WITHOUT #egg=: Also includes 'name' (cannot be derived from URL) Changes in make_package_info() (helpers.cpp): - Set defaulted_keys = {"_initialized"} for solver-derived packages - Solver packages have authoritative metadata from channel repodata - Only _initialized signals "trust all fields" (including patches) Updated header comment (package_info.hpp): - Removed outdated WARNING about make_package_info not setting defaulted_keys - New comment accurately describes the current behavior The asymmetry is intentional: - URL-derived: list stub fields to erase - Solver-derived: trust all fields, erase nothing Test Status: ALL TESTS PASS (GREEN phase) - test cases: 3 | 3 passed - assertions: 164 | 164 passed Tests fixed: - PackageInfo::from_url populates defaulted_keys (all 6 sections) Related: mamba-org#4095

…on (RED) Add tests demonstrating bugs in write_repodata_record(): C++ Tests (test_package_fetcher.cpp): 1. URL-derived metadata test: - Creates PackageInfo from URL (has stub defaults) - Creates tarball with correct index.json values - EXPECTS: repodata_record.json has correct values from index.json - Tests: license, timestamp, build_number 2. Channel patch preservation (depends): - Creates solver-derived PackageInfo with depends=[] - defaulted_keys = {"_initialized"} (only sentinel = trust all) - EXPECTS: Empty depends preserved in repodata_record.json - Verifies channel patches are NOT overwritten by index.json 3. Channel patch preservation (constrains): - Same as above but for constrains field Python Integration Tests (test_constructor.py): 1. test_url_derived_metadata_from_index_json 2. test_consistent_field_presence 3. test_track_features_omitted_when_empty Note: Python tests PASS because constructor.cpp uses index.json as base when no cached repodata exists. This validates the existing behavior. Python test improvements: - Use try-finally in teardown for robust env var restoration Test Status: - C++ tests: 3 | 0 passed | 3 failed (RED - demonstrates bugs) - Python tests: 3 | 3 passed (validates assumptions) Related: mamba-org#4095

…p (GREEN) Fix URL-derived package corruption and preserve channel patches by using the defaulted_keys mechanism in both write_repodata_record() and constructor.cpp. Implementation in package_fetcher.cpp: 1. FAIL-HARD CHECK: Verify "_initialized" sentinel is present - Throws std::logic_error if missing - Catches any code path that fails to set defaulted_keys - Re-throw std::logic_error from extract() to propagate programming bugs 2. ERASE STUB FIELDS: Remove fields listed in defaulted_keys before merge - URL-derived: erases build_number, license, timestamp, etc. - Solver-derived: defaulted_keys = {"_initialized"}, nothing erased - This preserves channel patches with intentionally empty arrays 3. MERGE WITH INDEX.JSON: insert() fills erased fields - URL-derived: gets correct values from index.json - Solver-derived: all fields already present, index.json ignored Implementation in constructor.cpp: Same defaulted_keys erasure logic BUT no fail-hard check (intentional asymmetry). Tests added: - PackageFetcher::write_repodata_record fails without _initialized Test Status: ALL TESTS PASS (GREEN phase) - C++ tests: 4 | 4 passed (21 assertions) - Python tests: 3 | 3 passed Tests fixed: - write_repodata_record uses index.json for URL-derived metadata - write_repodata_record preserves channel patched empty depends - write_repodata_record preserves channel patched empty constrains Related: mamba-org#4095

Add test demonstrating that EXISTING corrupted caches (from v2.1.1-v2.4.0) are NOT detected and healed by current code. The bug: has_valid_extracted_dir() only validates checksums, not metadata correctness. Corrupted caches with stub values (timestamp=0, license="") are considered valid and used as-is. Test scenario: 1. Create package tarball with CORRECT index.json (MIT license, etc.) 2. Create CORRUPTED repodata_record.json in cache (simulating bug) - timestamp=0, license="", build_number=0 3. Create PackageFetcher 4. EXPECT: needs_extract() returns true (corruption detected) 5. ACTUAL: returns false (corruption NOT detected) Healing flow (after fix): 1. has_valid_extracted_dir() detects corruption signature 2. Returns valid=false → needs_extract() returns true 3. Package re-extracted from tarball 4. write_repodata_record() writes correct values Test Status: TEST FAILS (expected RED phase) - test cases: 5 | 4 passed | 1 failed - assertions: 23 | 22 passed | 1 failed - needs_extract() returns false (should return true) Related: mamba-org#4095

Add corruption detection to has_valid_extracted_dir() to heal caches corrupted by v2.1.1-v2.4.0 bug. Corruption signature: timestamp == 0 AND license == "" When detected: 1. Log info message about the corruption detection 2. Return valid=false to invalidate the cache entry 3. This triggers re-extraction from the tarball 4. write_repodata_record() then writes correct values using index.json Why this signature is reliable (low false-positive risk): - timestamp=0 is extremely rare for legitimate packages (build tools set it) - Both conditions must be true (reduces false positives significantly) - license="" alone could be legitimate (some packages have no license) - timestamp=0 alone could theoretically be legitimate - Together they reliably indicate v2.1.1-v2.4.0 corruption Even if a false positive occurs: - Only consequence is unnecessary re-extraction (wasted work) - NOT data corruption - re-extracted package has correct values - This is acceptable given the rarity of the combination Added is_number()/is_string() type checks for defensive programming to prevent potential type_error exceptions if fields have unexpected types. Test Status: ALL TESTS PASS (GREEN phase) - test cases: 5 | 5 passed - assertions: 28 | 28 passed Tests fixed: - PackageFetcher heals existing corrupted cache Related: mamba-org#4095

Add tests verifying repodata_record.json consistency and checksum handling: New Tests: 1. depends/constrains presence test: - Tests that depends and constrains are always present as arrays - Even when index.json lacks these fields (like nlohmann_json-abi) - Matches conda behavior for field consistency - Status: FAILS (depends key not present when missing from index.json) 2. track_features omission test: - Tests that track_features is omitted when empty - Matches conda behavior to reduce JSON noise - Status: PASSES (current behavior already correct) 3. both checksums presence test: - Tests that both md5 and sha256 are always present - Should compute checksums from tarball when not available - Status: FAILS (md5/sha256 not computed when missing) 4. noarch backfill test: - Tests that noarch field is backfilled from index.json - Validates existing merge behavior for conditionally-written fields - Status: PASSES (validates assumption) 5. size from tarball test: - Tests that size=0 is filled from actual tarball file size - Validates existing special-case handling - Status: PASSES (validates assumption) 6. no false positive healing test: - Tests that timestamp=0 with license="MIT" is NOT healed - Verifies corruption signature requires BOTH conditions - Status: PASSES (validates healing specificity) Test Status: PARTIAL FAILURES (expected RED phase) - test cases: 11 | 9 passed | 2 failed - assertions: 53 | 51 passed | 2 failed - depends/constrains: FAIL - checksums: FAIL - Others: PASS (validate assumptions) Related: mamba-org#4095

…GREEN) - Add logic to ensure depends and constrains are always present as arrays - Ensure track_features is omitted when empty (not empty string) - Compute and fill md5/sha256 checksums from tarball when missing - Apply consistent field handling to constructor.cpp This matches conda behavior where: - depends and constrains are always present (even as empty arrays) - track_features is only included when non-empty - Both checksums are always available Test Status: ALL TESTS PASS (GREEN phase) - C++ test cases: 11 | 11 passed - C++ assertions: 61 | 61 passed - Python test cases: 3 | 3 passed Tests fixed: - PackageFetcher::write_repodata_record ensures depends/constrains present - PackageFetcher::write_repodata_record ensures both checksums Related: mamba-org#4095

codecov · 2025-11-29T21:29:58Z

Codecov Report

❌ Patch coverage is 96.21110% with 28 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.25%. Comparing base (48a330f) to head (238e7ea).

Files with missing lines	Patch %	Lines
micromamba/src/constructor.cpp	0.00%	20 Missing ⚠️
libmamba/src/core/package_fetcher.cpp	90.90%	3 Missing ⚠️
libmamba/src/specs/package_info.cpp	85.71%	2 Missing ⚠️
...ibmamba/tests/src/solver/libsolv/test_database.cpp	96.61%	2 Missing ⚠️
libmambapy/bindings/specs.cpp	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4110      +/-   ##
==========================================
+ Coverage   63.47%   64.25%   +0.77%     
==========================================
  Files         315      315              
  Lines       38632    39363     +731     
  Branches     2966     2991      +25     
==========================================
+ Hits        24521    25292     +771     
+ Misses      14042    14002      -40     
  Partials       69       69

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…d_keys (RED) Add tests demonstrating that lockfile-parsed PackageInfo objects are missing the "_initialized" sentinel in defaulted_keys for both: - conda-lock YAML format (env_lockfile_conda.cpp) - mambajs JSON format (env_lockfile_mambajs.cpp) PURPOSE: Verify that PackageInfo objects created from lockfile parsing have the "_initialized" sentinel, which is required for fail-hard verification in write_repodata_record(). MOTIVATION: Issue mamba-org#4095 - after refactoring env_lockfile.cpp into separate files for conda and mambajs formats, both code paths need to properly set defaulted_keys: - Conda format: calls from_url() but wasn't copying defaulted_keys - Mambajs format: builds PackageInfo manually without setting defaulted_keys Test Status: BOTH TESTS FAIL (expected RED phase) - test cases: 2 | 2 failed - assertions: 6 | 4 passed | 2 failed Related: mamba-org#4095

Fix both lockfile parsing implementations to properly set defaulted_keys with the "_initialized" sentinel required by write_repodata_record(). Changes: - env_lockfile_conda.cpp: Copy defaulted_keys from from_url() result - env_lockfile_mambajs.cpp: Set defaulted_keys = {"_initialized"} (lockfile data is authoritative, so no fields are "defaulted") This fix is required after the env_lockfile.cpp refactoring that split the implementation into separate files for conda and mambajs formats. Test Status: ALL TESTS PASS (GREEN phase) - lockfile_packages_have_initialized_in_defaulted_keys-conda: PASSED - lockfile_packages_have_initialized_in_defaulted_keys-mambajs: PASSED - All existing tests: PASSED Tests fixed: - test_lockfile[condalock-True] (Python integration test that was failing in CI) Related: mamba-org#4095

Add comprehensive tests verifying that defaulted_keys can be stored in and retrieved from a libsolv solvable via the solv-cpp wrapper. Tests added for: - Basic storage and retrieval of defaulted_keys list - Empty defaulted_keys preservation - Overwriting defaulted_keys - Single element (the common case for channel-derived packages) PURPOSE: The defaulted_keys field must survive the round-trip through libsolv when a PackageInfo is added to the solver database and later retrieved from the solution. This test ensures the wrapper provides the necessary storage mechanism. MOTIVATION: Issue mamba-org#4095 - URL-derived packages lose their defaulted_keys when going through the solver because libsolv doesn't natively store this field. The fix will use SOLVABLE_KEYWORDS to store a serialized representation. Test Status: COMPILATION FAILS (expected RED phase) - error: 'ObjSolvableView' has no member named 'set_defaulted_keys' - error: 'ObjSolvableView' has no member named 'defaulted_keys' Related: mamba-org#4095

Implement set_defaulted_keys() and defaulted_keys() methods in the libsolv solvable wrapper. The vector of keys is serialized as a comma-separated string using the SOLVABLE_KEYWORDS field for storage. SOLVABLE_KEYWORDS is repurposed for conda-specific metadata since it's not used in the conda ecosystem. This follows the pattern of other conda-specific fields stored in libsolv solvables. Implementation details: - set_defaulted_keys(): Serializes vector to "key1,key2,key3" format - defaulted_keys(): Parses comma-separated string back to vector - Empty list stores empty string (distinguishes from "not set") - Requires internalize() call before values are readable This allows the defaulted_keys field to survive the round-trip through libsolv when a PackageInfo is added to the solver database and later retrieved from the solution. Test Status: ALL TESTS PASS (GREEN phase) - test cases: 1 | 1 passed - assertions: 117 | 117 passed (including 6 new for defaulted_keys) Tests fixed: - Create a solvable > Set and get defaulted_keys (all 4 sections) Related: mamba-org#4095

…repodata (RED) Add comprehensive tests for defaulted_keys handling in the libsolv Database: 1. "Database preserves defaulted_keys" - Tests round-trip preservation: - Full defaulted_keys list survives database add/retrieve (FAILS) - Empty defaulted_keys becomes ["_initialized"] via fallback (passes) - Single _initialized is preserved (passes) 2. "Channel-derived packages from repodata have _initialized" - Tests that packages loaded from channel repodata JSON have ["_initialized"] (passes) The first test case reveals the bug: when a PackageInfo with a full defaulted_keys list (like from from_url()) is added via add_repo_from_packages() and retrieved via for_each_package_in_repo(), the list is lost and replaced with just ["_initialized"]. This is the root cause of issue mamba-org#4095: URL-derived packages lose their metadata tracking when going through the solver, causing write_repodata_record() to preserve stub values instead of replacing them with index.json data. SEMANTICS of defaulted_keys: - Empty: INVALID (missing "_initialized" sentinel) - ["_initialized"]: Properly initialized, trust all fields - ["_initialized", "field1", ...]: These fields have stub values to replace Test Status: 1 FAILING, 3 PASSING (expected RED phase) - test cases: 4 | 3 passed | 1 failed - assertions: 79 | 78 passed | 1 failed - FAILING: "defaulted_keys survives round-trip through database" Related: mamba-org#4095

Implement the complete fix for issue mamba-org#4095 by ensuring defaulted_keys survives the round-trip through libsolv and is properly initialized for all package sources. Changes in helpers.cpp: 1. set_solvable(PackageInfo) - Store defaulted_keys in solvable: Added `solv.set_defaulted_keys(pkg.defaulted_keys)` to preserve the field when a PackageInfo is added to the solver database. 2. make_package_info() - Read from solvable with fallback: Changed from always setting ["_initialized"] to reading from the solvable via `s.defaulted_keys()`. Added fallback to ["_initialized"] for backward compatibility with old .solv cache files that don't have defaulted_keys stored. 3. set_solvable(JSON template) - Set ["_initialized"] for channel repodata: Added `solv.set_defaulted_keys({ "_initialized" })` to mark packages loaded from channel repodata JSON as having authoritative metadata. This is required for write_repodata_record() to work correctly. Changes in specs.cpp (Python bindings): 4. PackageInfo constructor - Default to ["_initialized"]: Changed the default value from empty vector to ["_initialized"] for backward compatibility. Before this fix, make_package_info() always returned PackageInfo with ["_initialized"], so Python users creating PackageInfo manually should get the same default behavior. The fix ensures: - URL-derived packages: Full defaulted_keys list preserved through solver - Channel repodata packages: Have ["_initialized"] (trust all fields) - Old .solv cache files: Fallback to ["_initialized"] - Python-created PackageInfo: Default to ["_initialized"] Test Status: ALL TESTS PASS (GREEN phase) - test cases: 552 | 551 passed | 1 skipped - assertions: 27806 | 27806 passed Tests fixed: - Database preserves defaulted_keys (all 3 sections) - Channel-derived packages from repodata have _initialized Closes: mamba-org#4095

maresb · 2025-12-01T16:05:33Z

238e7ea is green up to a flaky connection error

Still todo: more thoroughly vet the last 6 commits, and revise the commit description accordingly.

Brief summary: I'm enabling libsolv to propagate defaulted_keys because there are situations where the solver receives explicit-style URLs, e.g. specifying a URL on the command line, and so without this (or some other mechanism) we still risk writing bad stub values.

maresb added 8 commits November 29, 2025 21:34

github-actions bot added the release::bug_fixes For PRs fixing bugs label Nov 29, 2025

maresb added 2 commits November 29, 2025 22:53

maresb force-pushed the fix/repodata-record-metadata-4095 branch from 82b18da to da17b04 Compare November 29, 2025 22:21

maresb marked this pull request as draft November 30, 2025 00:00

maresb added 4 commits December 1, 2025 13:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: correct repodata_record.json metadata for explicit installs (#4095) #4110

fix: correct repodata_record.json metadata for explicit installs (#4095) #4110

Uh oh!

maresb commented Nov 29, 2025 •

edited

Loading

Uh oh!

codecov bot commented Nov 29, 2025 •

edited

Loading

Uh oh!

maresb commented Dec 1, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: correct repodata_record.json metadata for explicit installs (#4095) #4110

Are you sure you want to change the base?

fix: correct repodata_record.json metadata for explicit installs (#4095) #4110

Uh oh!

Conversation

maresb commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Important Disclosure

Description

Background

Solution

Review Recommendation

Type of Change

Checklist

Uh oh!

codecov bot commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

maresb commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

maresb commented Nov 29, 2025 •

edited

Loading

codecov bot commented Nov 29, 2025 •

edited

Loading

maresb commented Dec 1, 2025 •

edited

Loading