fix: add PyTorch version verification with hardened error handling#116
fix: add PyTorch version verification with hardened error handling#116
Conversation
When PyTorch wheels are missing from a CUDA-specific index, pip silently installs an older version instead of failing. This caused the PyTorch 2.8.0 cu128 template to ship with PyTorch 2.4.1, making B200 GPUs (sm_100) completely unusable. This adds a post-install verification step that fails the build if the installed torch version does not match the requested version. Fixes #114 Related: #98, #101
Add word-boundary anchor, non-empty guards, and actionable error messages for every failure path so broken builds are immediately diagnosable instead of producing opaque exit codes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Hey @max4c - thanks so much for picking this up and expanding it. One small note that might be useful context while this is under No rush on this - just flagging in case it's helpful for whoever is Thank you and the team for the work on this. The verification check |
|
Hey @max4c - thanks so much for picking this up and expanding it. The hardened error handling is much better than what I had in #115, and the test plan covering both the failure and success cases is really thorough. Also appreciate the explicit credit in the description, that was very kind of you. One small note that might be useful context while this is under review: the same docker-bake.hcl wheel-source issue also affected the runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404 template, which delivered torch 2.4.1 on B200 pods instead of 2.8.0 (filed as support ticket #35526 back on April 1). Same root cause, so I think once this verification check lands and the images rebuild, it would fix both templates together. No rush on this - just flagging in case it's helpful for whoever is planning the rebuild scope. Happy to retest on a B200 once it ships, just let me know and I'll confirm on my end. Thank you and the team for the work on this. The verification check is exactly the kind of guardrail that would have caught all three open issues (#98, #101, #114) at build time. Really nice improvement for the platform. |
Summary
Based on the excellent work by @dentity007 in #115. Their original commit is preserved in this branch.
Problem
pip silently falls back to older PyTorch versions when wheels are missing from a CUDA-specific index. This caused the
runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404template to ship PyTorch 2.4.1 instead of 2.8.0, making B200 GPUs (sm_100) completely unusable.See #114, #98, #101 for user reports.
Changes
TORCHbuild arg, compares againsttorch.__version__, and fails the build on mismatch\b) andhead -1to future-proof extractionTest plan
Fixes #114
Supersedes #115
🤖 Generated with Claude Code