Skip to content

fix: add PyTorch version verification with hardened error handling#116

Open
max4c wants to merge 2 commits intomainfrom
fix/pytorch-version-verification
Open

fix: add PyTorch version verification with hardened error handling#116
max4c wants to merge 2 commits intomainfrom
fix/pytorch-version-verification

Conversation

@max4c
Copy link
Copy Markdown
Contributor

@max4c max4c commented Apr 8, 2026

Summary

  • Adds build-time verification that the installed PyTorch version matches the requested version, preventing pip from silently falling back to older versions
  • Hardens all failure paths with explicit, actionable error messages

Based on the excellent work by @dentity007 in #115. Their original commit is preserved in this branch.

Problem

pip silently falls back to older PyTorch versions when wheels are missing from a CUDA-specific index. This caused the runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404 template to ship PyTorch 2.4.1 instead of 2.8.0, making B200 GPUs (sm_100) completely unusable.

See #114, #98, #101 for user reports.

Changes

  1. Version verification (from Add PyTorch version verification to prevent silent pip fallback #115): Post-install RUN step extracts expected torch version from the TORCH build arg, compares against torch.__version__, and fails the build on mismatch
  2. Hardened error handling: Every failure path now produces a clear, actionable error message instead of opaque exit codes:
    • grep extraction failure → reports the TORCH arg value
    • Empty extracted version → reports the TORCH arg value
    • torch import failure → tells user to check pip output
    • Empty installed version → explicit error
    • Version mismatch → reports expected, installed, wheel source, and check URL
  3. Defensive regex: Word-boundary anchor (\b) and head -1 to future-proof extraction

Test plan

  • Build a pytorch image with valid TORCH arg — verify it succeeds
  • Build with intentionally wrong wheel source — verify mismatch error with all fields populated
  • Build with empty TORCH arg — verify clear "could not extract" error
  • Rebuild existing cu128/torch280 matrix entry — verify it now succeeds (wheels exist)

Fixes #114
Supersedes #115

🤖 Generated with Claude Code

dentity007 and others added 2 commits April 2, 2026 10:49
When PyTorch wheels are missing from a CUDA-specific index, pip silently
installs an older version instead of failing. This caused the PyTorch 2.8.0
cu128 template to ship with PyTorch 2.4.1, making B200 GPUs (sm_100)
completely unusable.

This adds a post-install verification step that fails the build if the
installed torch version does not match the requested version.

Fixes #114
Related: #98, #101
Add word-boundary anchor, non-empty guards, and actionable error
messages for every failure path so broken builds are immediately
diagnosable instead of producing opaque exit codes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dentity007
Copy link
Copy Markdown

Hey @max4c - thanks so much for picking this up and expanding it.
The hardened error handling is much better than what I had in #115,
and the test plan covering both the failure and success cases is
really thorough. Also appreciate the explicit credit in the
description, that was very kind of you.

One small note that might be useful context while this is under
review: the same docker-bake.hcl wheel-source issue also affected
the runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404 template, which
delivered torch 2.4.1 on B200 pods instead of 2.8.0 (filed as
support ticket #35526 back on April 1). Same root cause, so I think
once this verification check lands and the images rebuild, it would
fix both templates together.

No rush on this - just flagging in case it's helpful for whoever is
planning the rebuild scope. Happy to retest on a B200 once it ships,
just let me know and I'll confirm on my end.

Thank you and the team for the work on this. The verification check
is exactly the kind of guardrail that would have caught all three
open issues (#98, #101, #114) at build time. Really nice improvement
for the platform.

@dentity007
Copy link
Copy Markdown

Hey @max4c - thanks so much for picking this up and expanding it. The hardened error handling is much better than what I had in #115, and the test plan covering both the failure and success cases is really thorough. Also appreciate the explicit credit in the description, that was very kind of you.

One small note that might be useful context while this is under review: the same docker-bake.hcl wheel-source issue also affected the runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404 template, which delivered torch 2.4.1 on B200 pods instead of 2.8.0 (filed as support ticket #35526 back on April 1). Same root cause, so I think once this verification check lands and the images rebuild, it would fix both templates together.

No rush on this - just flagging in case it's helpful for whoever is planning the rebuild scope. Happy to retest on a B200 once it ships, just let me know and I'll confirm on my end.

Thank you and the team for the work on this. The verification check is exactly the kind of guardrail that would have caught all three open issues (#98, #101, #114) at build time. Really nice improvement for the platform.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PyTorch 2.8.0 cu128 template installs 2.4.1 — missing cu128 wheels

2 participants