Skip to content

Conversation

@maresb
Copy link
Contributor

@maresb maresb commented Nov 29, 2025

⚠️ Important Disclosure

I do not know C++. This PR was extremely heavily LLM-assisted (Claude in Cursor). As a result, please review this PR skeptically and critically. I have spent significantly more time than typical attempting to verify correctness to the extent I'm able.

Description

This PR fixes #4095: incorrect repodata_record.json precedence for explicit installs.

Background

Since v2.1.1, libmamba switched to "repodata-first" when constructing info/repodata_record.json (intended to honor channel repodata patches over info/index.json). However, for explicit installs (URL-based), the "repodata" object is a URL-derived stub containing placeholder/zero values—not real channel repodata. Making that stub authoritative caused repodata_record.json to be written with incomplete/zeroed fields.

Affected versions:

The v2.3.3 partial fix (#4071) removed empty depends/constrains arrays so they could be backfilled from index.json. However, this approach has two problems:

  1. Channel patches undone: Patches that intentionally set depends/constrains to [] are silently overwritten by index.json values
  2. Other fields still wrong: build_number, license, timestamp, and track_features remained corrupted

Solution

This PR leverages the existing defaulted_keys field (originally introduced in PR #1120 for signature verification, but no longer populated since the 2023 libsolv wrapper refactor) to distinguish URL-derived stub values from authoritative solver/channel values:

  • URL-derived packages (from_url()): defaulted_keys lists stub field names (e.g., ["_initialized", "license", "timestamp", ...]). These fields are erased before merging with index.json, allowing correct values to be filled in.

  • Solver-derived packages (make_package_info()): defaulted_keys = ["_initialized"] only. The sentinel signals "trust ALL fields"—nothing is erased, preserving channel patches including intentionally empty arrays.

The _initialized sentinel enables fail-hard verification: write_repodata_record() throws std::logic_error if it's missing, catching any code path that fails to properly initialize defaulted_keys.

Additionally, this PR adds cache healing to detect and fix corrupted caches from v2.1.1–v2.4.0 using a corruption signature (timestamp == 0 AND license == "").

Review Recommendation

Please review commit-by-commit. Each commit has an extended commit message with:

  • Detailed explanation of the changes
  • Test status (assertions passed/failed)
  • Rationale for design decisions

The commits follow TDD (Test-Driven Development) with alternating RED (failing tests) and GREEN (fix) commits:

Commit Type Description
01 RED Tests for defaulted_keys population in from_url()
02 GREEN Populate defaulted_keys in from_url() and make_package_info()
03 RED Tests for URL-derived metadata and channel patch preservation
04 GREEN Use defaulted_keys in write_repodata_record() and constructor.cpp
05 RED Test for cache healing
06 GREEN Detect corrupted cache in has_valid_extracted_dir()
07 RED Tests for consistent field presence and checksums
08 GREEN Ensure consistent field presence and compute missing checksums

Commits 07-08 add consistency improvements: ensuring depends/constrains are always present as arrays, omitting track_features when empty (matching conda behavior), and computing missing checksums from the tarball.

Type of Change

  • Bugfix

Checklist

  • My code follows the general style and conventions of the codebase, ensuring consistency
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • I have run pre-commit run --all locally in the source folder and confirmed that there are no linter errors.
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing tests pass locally with my changes

Add comprehensive tests verifying that PackageInfo::from_url() populates
the defaulted_keys field to mark which fields have stub/default values.

Tests added for:
- Conda packages (.conda, .tar.bz2): Standard stub fields
- Wheel packages (.whl): Standard + build/build_string
- TarGz packages (.tar.gz): Same as wheels
- Git URLs with #egg=: Most fields defaulted (minimal info from URL)
- Git URLs WITHOUT #egg=: Also includes 'name' in defaulted_keys

Each test verifies "_initialized" sentinel presence (required for all paths)
plus appropriate stub field markers.

Test Status: ALL TESTS FAIL (expected RED phase)
- test cases: 1 | 1 failed
- assertions: 6 | 6 failed (all for missing _initialized sentinel)

Related: mamba-org#4095
…REEN)

Populate the defaulted_keys field in PackageInfo creation paths to mark
which fields have stub/default values. This enables write_repodata_record()
to erase these stubs before merging with index.json.

The "_initialized" sentinel is placed first in defaulted_keys for all paths.
This enables fail-hard verification: write_repodata_record() will throw if
_initialized is missing, catching any code path that fails to set defaulted_keys.

Changes in from_url() (package_info.cpp):
- Conda (.conda, .tar.bz2): _initialized + build_number, license, timestamp,
  track_features, depends, constrains
- Wheel (.whl): Above + build, build_string (not in wheel filename)
- TarGz (.tar.gz): Same as wheels
- Git URLs: _initialized + version, channel, subdir, fn, build, build_string,
  build_number, license, timestamp, track_features, depends, constrains
- Git URLs WITHOUT #egg=: Also includes 'name' (cannot be derived from URL)

Changes in make_package_info() (helpers.cpp):
- Set defaulted_keys = {"_initialized"} for solver-derived packages
- Solver packages have authoritative metadata from channel repodata
- Only _initialized signals "trust all fields" (including patches)

Updated header comment (package_info.hpp):
- Removed outdated WARNING about make_package_info not setting defaulted_keys
- New comment accurately describes the current behavior

The asymmetry is intentional:
- URL-derived: list stub fields to erase
- Solver-derived: trust all fields, erase nothing

Test Status: ALL TESTS PASS (GREEN phase)
- test cases: 3 | 3 passed
- assertions: 164 | 164 passed

Tests fixed:
- PackageInfo::from_url populates defaulted_keys (all 6 sections)

Related: mamba-org#4095
…on (RED)

Add tests demonstrating bugs in write_repodata_record():

C++ Tests (test_package_fetcher.cpp):

1. URL-derived metadata test:
   - Creates PackageInfo from URL (has stub defaults)
   - Creates tarball with correct index.json values
   - EXPECTS: repodata_record.json has correct values from index.json
   - Tests: license, timestamp, build_number

2. Channel patch preservation (depends):
   - Creates solver-derived PackageInfo with depends=[]
   - defaulted_keys = {"_initialized"} (only sentinel = trust all)
   - EXPECTS: Empty depends preserved in repodata_record.json
   - Verifies channel patches are NOT overwritten by index.json

3. Channel patch preservation (constrains):
   - Same as above but for constrains field

Python Integration Tests (test_constructor.py):

1. test_url_derived_metadata_from_index_json
2. test_consistent_field_presence
3. test_track_features_omitted_when_empty

Note: Python tests PASS because constructor.cpp uses index.json as base
when no cached repodata exists. This validates the existing behavior.

Python test improvements:
- Use try-finally in teardown for robust env var restoration

Test Status:
- C++ tests: 3 | 0 passed | 3 failed (RED - demonstrates bugs)
- Python tests: 3 | 3 passed (validates assumptions)

Related: mamba-org#4095
…p (GREEN)

Fix URL-derived package corruption and preserve channel patches by using
the defaulted_keys mechanism in both write_repodata_record() and constructor.cpp.

Implementation in package_fetcher.cpp:

1. FAIL-HARD CHECK: Verify "_initialized" sentinel is present
   - Throws std::logic_error if missing
   - Catches any code path that fails to set defaulted_keys
   - Re-throw std::logic_error from extract() to propagate programming bugs

2. ERASE STUB FIELDS: Remove fields listed in defaulted_keys before merge
   - URL-derived: erases build_number, license, timestamp, etc.
   - Solver-derived: defaulted_keys = {"_initialized"}, nothing erased
   - This preserves channel patches with intentionally empty arrays

3. MERGE WITH INDEX.JSON: insert() fills erased fields
   - URL-derived: gets correct values from index.json
   - Solver-derived: all fields already present, index.json ignored

Implementation in constructor.cpp:
Same defaulted_keys erasure logic BUT no fail-hard check (intentional asymmetry).

Tests added:
- PackageFetcher::write_repodata_record fails without _initialized

Test Status: ALL TESTS PASS (GREEN phase)
- C++ tests: 4 | 4 passed (21 assertions)
- Python tests: 3 | 3 passed

Tests fixed:
- write_repodata_record uses index.json for URL-derived metadata
- write_repodata_record preserves channel patched empty depends
- write_repodata_record preserves channel patched empty constrains

Related: mamba-org#4095
Add test demonstrating that EXISTING corrupted caches (from v2.1.1-v2.4.0)
are NOT detected and healed by current code.

The bug: has_valid_extracted_dir() only validates checksums, not metadata
correctness. Corrupted caches with stub values (timestamp=0, license="")
are considered valid and used as-is.

Test scenario:
1. Create package tarball with CORRECT index.json (MIT license, etc.)
2. Create CORRUPTED repodata_record.json in cache (simulating bug)
   - timestamp=0, license="", build_number=0
3. Create PackageFetcher
4. EXPECT: needs_extract() returns true (corruption detected)
5. ACTUAL: returns false (corruption NOT detected)

Healing flow (after fix):
1. has_valid_extracted_dir() detects corruption signature
2. Returns valid=false → needs_extract() returns true
3. Package re-extracted from tarball
4. write_repodata_record() writes correct values

Test Status: TEST FAILS (expected RED phase)
- test cases: 5 | 4 passed | 1 failed
- assertions: 23 | 22 passed | 1 failed
- needs_extract() returns false (should return true)

Related: mamba-org#4095
Add corruption detection to has_valid_extracted_dir() to heal caches
corrupted by v2.1.1-v2.4.0 bug.

Corruption signature: timestamp == 0 AND license == ""

When detected:
1. Log info message about the corruption detection
2. Return valid=false to invalidate the cache entry
3. This triggers re-extraction from the tarball
4. write_repodata_record() then writes correct values using index.json

Why this signature is reliable (low false-positive risk):
- timestamp=0 is extremely rare for legitimate packages (build tools set it)
- Both conditions must be true (reduces false positives significantly)
- license="" alone could be legitimate (some packages have no license)
- timestamp=0 alone could theoretically be legitimate
- Together they reliably indicate v2.1.1-v2.4.0 corruption

Even if a false positive occurs:
- Only consequence is unnecessary re-extraction (wasted work)
- NOT data corruption - re-extracted package has correct values
- This is acceptable given the rarity of the combination

Added is_number()/is_string() type checks for defensive programming to
prevent potential type_error exceptions if fields have unexpected types.

Test Status: ALL TESTS PASS (GREEN phase)
- test cases: 5 | 5 passed
- assertions: 28 | 28 passed

Tests fixed:
- PackageFetcher heals existing corrupted cache

Related: mamba-org#4095
Add tests verifying repodata_record.json consistency and checksum handling:

New Tests:

1. depends/constrains presence test:
   - Tests that depends and constrains are always present as arrays
   - Even when index.json lacks these fields (like nlohmann_json-abi)
   - Matches conda behavior for field consistency
   - Status: FAILS (depends key not present when missing from index.json)

2. track_features omission test:
   - Tests that track_features is omitted when empty
   - Matches conda behavior to reduce JSON noise
   - Status: PASSES (current behavior already correct)

3. both checksums presence test:
   - Tests that both md5 and sha256 are always present
   - Should compute checksums from tarball when not available
   - Status: FAILS (md5/sha256 not computed when missing)

4. noarch backfill test:
   - Tests that noarch field is backfilled from index.json
   - Validates existing merge behavior for conditionally-written fields
   - Status: PASSES (validates assumption)

5. size from tarball test:
   - Tests that size=0 is filled from actual tarball file size
   - Validates existing special-case handling
   - Status: PASSES (validates assumption)

6. no false positive healing test:
   - Tests that timestamp=0 with license="MIT" is NOT healed
   - Verifies corruption signature requires BOTH conditions
   - Status: PASSES (validates healing specificity)

Test Status: PARTIAL FAILURES (expected RED phase)
- test cases: 11 | 9 passed | 2 failed
- assertions: 53 | 51 passed | 2 failed
- depends/constrains: FAIL
- checksums: FAIL
- Others: PASS (validate assumptions)

Related: mamba-org#4095
…GREEN)

- Add logic to ensure depends and constrains are always present as arrays
- Ensure track_features is omitted when empty (not empty string)
- Compute and fill md5/sha256 checksums from tarball when missing
- Apply consistent field handling to constructor.cpp

This matches conda behavior where:
- depends and constrains are always present (even as empty arrays)
- track_features is only included when non-empty
- Both checksums are always available

Test Status: ALL TESTS PASS (GREEN phase)
- C++ test cases: 11 | 11 passed
- C++ assertions: 61 | 61 passed
- Python test cases: 3 | 3 passed

Tests fixed:
- PackageFetcher::write_repodata_record ensures depends/constrains present
- PackageFetcher::write_repodata_record ensures both checksums

Related: mamba-org#4095
@github-actions github-actions bot added the release::bug_fixes For PRs fixing bugs label Nov 29, 2025
@codecov
Copy link

codecov bot commented Nov 29, 2025

Codecov Report

❌ Patch coverage is 96.21110% with 28 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.25%. Comparing base (48a330f) to head (238e7ea).

Files with missing lines Patch % Lines
micromamba/src/constructor.cpp 0.00% 20 Missing ⚠️
libmamba/src/core/package_fetcher.cpp 90.90% 3 Missing ⚠️
libmamba/src/specs/package_info.cpp 85.71% 2 Missing ⚠️
...ibmamba/tests/src/solver/libsolv/test_database.cpp 96.61% 2 Missing ⚠️
libmambapy/bindings/specs.cpp 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4110      +/-   ##
==========================================
+ Coverage   63.47%   64.25%   +0.77%     
==========================================
  Files         315      315              
  Lines       38632    39363     +731     
  Branches     2966     2991      +25     
==========================================
+ Hits        24521    25292     +771     
+ Misses      14042    14002      -40     
  Partials       69       69              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…d_keys (RED)

Add tests demonstrating that lockfile-parsed PackageInfo objects are
missing the "_initialized" sentinel in defaulted_keys for both:
- conda-lock YAML format (env_lockfile_conda.cpp)
- mambajs JSON format (env_lockfile_mambajs.cpp)

PURPOSE: Verify that PackageInfo objects created from lockfile parsing
have the "_initialized" sentinel, which is required for fail-hard
verification in write_repodata_record().

MOTIVATION: Issue mamba-org#4095 - after refactoring env_lockfile.cpp into separate
files for conda and mambajs formats, both code paths need to properly set
defaulted_keys:
- Conda format: calls from_url() but wasn't copying defaulted_keys
- Mambajs format: builds PackageInfo manually without setting defaulted_keys

Test Status: BOTH TESTS FAIL (expected RED phase)
- test cases: 2 | 2 failed
- assertions: 6 | 4 passed | 2 failed

Related: mamba-org#4095
Fix both lockfile parsing implementations to properly set defaulted_keys
with the "_initialized" sentinel required by write_repodata_record().

Changes:
- env_lockfile_conda.cpp: Copy defaulted_keys from from_url() result
- env_lockfile_mambajs.cpp: Set defaulted_keys = {"_initialized"}
  (lockfile data is authoritative, so no fields are "defaulted")

This fix is required after the env_lockfile.cpp refactoring that split
the implementation into separate files for conda and mambajs formats.

Test Status: ALL TESTS PASS (GREEN phase)
- lockfile_packages_have_initialized_in_defaulted_keys-conda: PASSED
- lockfile_packages_have_initialized_in_defaulted_keys-mambajs: PASSED
- All existing tests: PASSED

Tests fixed:
- test_lockfile[condalock-True] (Python integration test that was failing in CI)

Related: mamba-org#4095
@maresb maresb force-pushed the fix/repodata-record-metadata-4095 branch from 82b18da to da17b04 Compare November 29, 2025 22:21
@maresb maresb marked this pull request as draft November 30, 2025 00:00
Add comprehensive tests verifying that defaulted_keys can be stored in
and retrieved from a libsolv solvable via the solv-cpp wrapper.

Tests added for:
- Basic storage and retrieval of defaulted_keys list
- Empty defaulted_keys preservation
- Overwriting defaulted_keys
- Single element (the common case for channel-derived packages)

PURPOSE: The defaulted_keys field must survive the round-trip through
libsolv when a PackageInfo is added to the solver database and later
retrieved from the solution. This test ensures the wrapper provides
the necessary storage mechanism.

MOTIVATION: Issue mamba-org#4095 - URL-derived packages lose their defaulted_keys
when going through the solver because libsolv doesn't natively store this
field. The fix will use SOLVABLE_KEYWORDS to store a serialized representation.

Test Status: COMPILATION FAILS (expected RED phase)
- error: 'ObjSolvableView' has no member named 'set_defaulted_keys'
- error: 'ObjSolvableView' has no member named 'defaulted_keys'

Related: mamba-org#4095
Implement set_defaulted_keys() and defaulted_keys() methods in the
libsolv solvable wrapper. The vector of keys is serialized as a
comma-separated string using the SOLVABLE_KEYWORDS field for storage.

SOLVABLE_KEYWORDS is repurposed for conda-specific metadata since it's
not used in the conda ecosystem. This follows the pattern of other
conda-specific fields stored in libsolv solvables.

Implementation details:
- set_defaulted_keys(): Serializes vector to "key1,key2,key3" format
- defaulted_keys(): Parses comma-separated string back to vector
- Empty list stores empty string (distinguishes from "not set")
- Requires internalize() call before values are readable

This allows the defaulted_keys field to survive the round-trip through
libsolv when a PackageInfo is added to the solver database and later
retrieved from the solution.

Test Status: ALL TESTS PASS (GREEN phase)
- test cases: 1 | 1 passed
- assertions: 117 | 117 passed (including 6 new for defaulted_keys)

Tests fixed:
- Create a solvable > Set and get defaulted_keys (all 4 sections)

Related: mamba-org#4095
…repodata (RED)

Add comprehensive tests for defaulted_keys handling in the libsolv Database:

1. "Database preserves defaulted_keys" - Tests round-trip preservation:
   - Full defaulted_keys list survives database add/retrieve (FAILS)
   - Empty defaulted_keys becomes ["_initialized"] via fallback (passes)
   - Single _initialized is preserved (passes)

2. "Channel-derived packages from repodata have _initialized" - Tests that
   packages loaded from channel repodata JSON have ["_initialized"] (passes)

The first test case reveals the bug: when a PackageInfo with a full
defaulted_keys list (like from from_url()) is added via add_repo_from_packages()
and retrieved via for_each_package_in_repo(), the list is lost and replaced
with just ["_initialized"].

This is the root cause of issue mamba-org#4095: URL-derived packages lose their
metadata tracking when going through the solver, causing write_repodata_record()
to preserve stub values instead of replacing them with index.json data.

SEMANTICS of defaulted_keys:
- Empty: INVALID (missing "_initialized" sentinel)
- ["_initialized"]: Properly initialized, trust all fields
- ["_initialized", "field1", ...]: These fields have stub values to replace

Test Status: 1 FAILING, 3 PASSING (expected RED phase)
- test cases: 4 | 3 passed | 1 failed
- assertions: 79 | 78 passed | 1 failed
- FAILING: "defaulted_keys survives round-trip through database"

Related: mamba-org#4095
Implement the complete fix for issue mamba-org#4095 by ensuring defaulted_keys
survives the round-trip through libsolv and is properly initialized for
all package sources.

Changes in helpers.cpp:

1. set_solvable(PackageInfo) - Store defaulted_keys in solvable:
   Added `solv.set_defaulted_keys(pkg.defaulted_keys)` to preserve the
   field when a PackageInfo is added to the solver database.

2. make_package_info() - Read from solvable with fallback:
   Changed from always setting ["_initialized"] to reading from the
   solvable via `s.defaulted_keys()`. Added fallback to ["_initialized"]
   for backward compatibility with old .solv cache files that don't have
   defaulted_keys stored.

3. set_solvable(JSON template) - Set ["_initialized"] for channel repodata:
   Added `solv.set_defaulted_keys({ "_initialized" })` to mark packages
   loaded from channel repodata JSON as having authoritative metadata.
   This is required for write_repodata_record() to work correctly.

Changes in specs.cpp (Python bindings):

4. PackageInfo constructor - Default to ["_initialized"]:
   Changed the default value from empty vector to ["_initialized"] for
   backward compatibility. Before this fix, make_package_info() always
   returned PackageInfo with ["_initialized"], so Python users creating
   PackageInfo manually should get the same default behavior.

The fix ensures:
- URL-derived packages: Full defaulted_keys list preserved through solver
- Channel repodata packages: Have ["_initialized"] (trust all fields)
- Old .solv cache files: Fallback to ["_initialized"]
- Python-created PackageInfo: Default to ["_initialized"]

Test Status: ALL TESTS PASS (GREEN phase)
- test cases: 552 | 551 passed | 1 skipped
- assertions: 27806 | 27806 passed

Tests fixed:
- Database preserves defaulted_keys (all 3 sections)
- Channel-derived packages from repodata have _initialized

Closes: mamba-org#4095
@maresb
Copy link
Contributor Author

maresb commented Dec 1, 2025

238e7ea is green up to a flaky connection error

Still todo: more thoroughly vet the last 6 commits, and revise the commit description accordingly.

Brief summary: I'm enabling libsolv to propagate defaulted_keys because there are situations where the solver receives explicit-style URLs, e.g. specifying a URL on the command line, and so without this (or some other mechanism) we still risk writing bad stub values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release::bug_fixes For PRs fixing bugs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect repodata_record.json precedence for explicit installs

1 participant