-
Notifications
You must be signed in to change notification settings - Fork 125
RHAIENG-1512, RHAIENG-406: docs(tests): document developer-centric testing procedures for opendatahub-io/notebooks pytorch workbenches #2771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…sting procedures for opendatahub-io/notebooks pytorch workbenches https://issues.redhat.com/browse/RHAIENG-1512 https://issues.redhat.com/browse/RHAIENG-406
|
Skipping CI for Draft Pull Request. |
WalkthroughTwo new Jupyter notebooks are added to the tests/manual/ directory for GPU environment testing and PyTorch workflows. The GPU test notebook detects installed ML frameworks and GPU hardware configuration. The PyTorch test notebook demonstrates FashionMNIST training, basic tensor operations, and GPU-focused execution workflows. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ide-developer The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tests/manual/gpu-test-notebook.ipynb(1 hunks)tests/manual/pytorch-test-notebook.ipynb(1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.8)
tests/manual/gpu-test-notebook.ipynb
35-35: Do not use bare except
(E722)
57-57: Do not use bare except
(E722)
tests/manual/pytorch-test-notebook.ipynb
97-97: Comment contains ambiguous ’ (RIGHT SINGLE QUOTATION MARK). Did you mean ``` (GRAVE ACCENT)?
(RUF003)
119-119: Comment contains ambiguous ’ (RIGHT SINGLE QUOTATION MARK). Did you mean ``` (GRAVE ACCENT)?
(RUF003)
348-348: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (1)
tests/manual/pytorch-test-notebook.ipynb (1)
216-217: Thetorch.acceleratorAPI is appropriate for this project. It was introduced in PyTorch 2.6, and your project uses PyTorch 2.7.1, which fully supports this modern unified device detection API. No changes are needed.
| "def import_pytorch():\n", | ||
| " global framework, gpu_type\n", | ||
| " import torch\n", | ||
| " framework = \"PYTORCH\"\n", | ||
| "\n", | ||
| " if torch.cuda.is_available():\n", | ||
| " # Check if it's actually ROCm by looking at device name\n", | ||
| " try:\n", | ||
| " device_name = torch.cuda.get_device_name(0)\n", | ||
| "\n", | ||
| " # Check for AMD/ROCm indicators in device name\n", | ||
| " if any(keyword in device_name.upper() for keyword in ['AMD', 'RADEON', 'INSTINCT', 'MI300', 'MI250', 'MI100']):\n", | ||
| " gpu_type = \"ROCM\"\n", | ||
| " else:\n", | ||
| " gpu_type = \"CUDA\"\n", | ||
| " except:\n", | ||
| " # Fallback: check if ROCm version exists\n", | ||
| " if hasattr(torch.version, 'hip') and torch.version.hip is not None:\n", | ||
| " gpu_type = \"ROCM\"\n", | ||
| " else:\n", | ||
| " gpu_type = \"CUDA\"\n", | ||
| " elif hasattr(torch.version, 'hip') and torch.version.hip is not None:\n", | ||
| " gpu_type = \"ROCM\"\n", | ||
| "\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace bare except with specific exception type.
The bare except clause at line 77 will catch system exits and keyboard interrupts, which is usually unintended. Catch Exception instead to allow system signals to propagate.
Apply this diff to improve error handling:
try:
device_name = torch.cuda.get_device_name(0)
# Check for AMD/ROCm indicators in device name
if any(keyword in device_name.upper() for keyword in ['AMD', 'RADEON', 'INSTINCT', 'MI300', 'MI250', 'MI100']):
gpu_type = "ROCM"
else:
gpu_type = "CUDA"
- except:
+ except Exception:
# Fallback: check if ROCm version exists
if hasattr(torch.version, 'hip') and torch.version.hip is not None:
gpu_type = "ROCM"
else:
gpu_type = "CUDA"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "def import_pytorch():\n", | |
| " global framework, gpu_type\n", | |
| " import torch\n", | |
| " framework = \"PYTORCH\"\n", | |
| "\n", | |
| " if torch.cuda.is_available():\n", | |
| " # Check if it's actually ROCm by looking at device name\n", | |
| " try:\n", | |
| " device_name = torch.cuda.get_device_name(0)\n", | |
| "\n", | |
| " # Check for AMD/ROCm indicators in device name\n", | |
| " if any(keyword in device_name.upper() for keyword in ['AMD', 'RADEON', 'INSTINCT', 'MI300', 'MI250', 'MI100']):\n", | |
| " gpu_type = \"ROCM\"\n", | |
| " else:\n", | |
| " gpu_type = \"CUDA\"\n", | |
| " except:\n", | |
| " # Fallback: check if ROCm version exists\n", | |
| " if hasattr(torch.version, 'hip') and torch.version.hip is not None:\n", | |
| " gpu_type = \"ROCM\"\n", | |
| " else:\n", | |
| " gpu_type = \"CUDA\"\n", | |
| " elif hasattr(torch.version, 'hip') and torch.version.hip is not None:\n", | |
| " gpu_type = \"ROCM\"\n", | |
| "\n", | |
| "def import_pytorch():\n", | |
| " global framework, gpu_type\n", | |
| " import torch\n", | |
| " framework = \"PYTORCH\"\n", | |
| "\n", | |
| " if torch.cuda.is_available():\n", | |
| " # Check if it's actually ROCm by looking at device name\n", | |
| " try:\n", | |
| " device_name = torch.cuda.get_device_name(0)\n", | |
| "\n", | |
| " # Check for AMD/ROCm indicators in device name\n", | |
| " if any(keyword in device_name.upper() for keyword in ['AMD', 'RADEON', 'INSTINCT', 'MI300', 'MI250', 'MI100']):\n", | |
| " gpu_type = \"ROCM\"\n", | |
| " else:\n", | |
| " gpu_type = \"CUDA\"\n", | |
| " except Exception:\n", | |
| " # Fallback: check if ROCm version exists\n", | |
| " if hasattr(torch.version, 'hip') and torch.version.hip is not None:\n", | |
| " gpu_type = \"ROCM\"\n", | |
| " else:\n", | |
| " gpu_type = \"CUDA\"\n", | |
| " elif hasattr(torch.version, 'hip') and torch.version.hip is not None:\n", | |
| " gpu_type = \"ROCM\"\n", |
🤖 Prompt for AI Agents
In tests/manual/gpu-test-notebook.ipynb around lines 62 to 85, replace the bare
"except:" at line ~77 with a specific exception handler to avoid catching
system-exiting exceptions; change it to "except Exception:" or "except Exception
as e:" (and use e if you want to inspect/log it) so only regular exceptions are
caught and KeyboardInterrupt/SystemExit can propagate.
| "def import_tensorflow():\n", | ||
| " global framework, gpu_type\n", | ||
| " import tensorflow as tf\n", | ||
| " framework = \"TENSORFLOW\"\n", | ||
| "\n", | ||
| " if len(tf.config.list_physical_devices('GPU')) > 0:\n", | ||
| " try:\n", | ||
| " gpu_details = tf.config.experimental.get_device_details(tf.config.list_physical_devices('GPU')[0])\n", | ||
| " if 'AMD' in str(gpu_details) or 'ROCm' in str(gpu_details):\n", | ||
| " gpu_type = \"ROCM\"\n", | ||
| " else:\n", | ||
| " gpu_type = \"CUDA\"\n", | ||
| " except:\n", | ||
| " gpu_type = \"CUDA\"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace bare except with specific exception type.
The bare except clause at line 98 will catch system exits and keyboard interrupts. Use Exception instead.
Apply this diff:
try:
gpu_details = tf.config.experimental.get_device_details(tf.config.list_physical_devices('GPU')[0])
if 'AMD' in str(gpu_details) or 'ROCm' in str(gpu_details):
gpu_type = "ROCM"
else:
gpu_type = "CUDA"
- except:
+ except Exception:
gpu_type = "CUDA"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "def import_tensorflow():\n", | |
| " global framework, gpu_type\n", | |
| " import tensorflow as tf\n", | |
| " framework = \"TENSORFLOW\"\n", | |
| "\n", | |
| " if len(tf.config.list_physical_devices('GPU')) > 0:\n", | |
| " try:\n", | |
| " gpu_details = tf.config.experimental.get_device_details(tf.config.list_physical_devices('GPU')[0])\n", | |
| " if 'AMD' in str(gpu_details) or 'ROCm' in str(gpu_details):\n", | |
| " gpu_type = \"ROCM\"\n", | |
| " else:\n", | |
| " gpu_type = \"CUDA\"\n", | |
| " except:\n", | |
| " gpu_type = \"CUDA\"\n", | |
| "def import_tensorflow():\n", | |
| " global framework, gpu_type\n", | |
| " import tensorflow as tf\n", | |
| " framework = \"TENSORFLOW\"\n", | |
| "\n", | |
| " if len(tf.config.list_physical_devices('GPU')) > 0:\n", | |
| " try:\n", | |
| " gpu_details = tf.config.experimental.get_device_details(tf.config.list_physical_devices('GPU')[0])\n", | |
| " if 'AMD' in str(gpu_details) or 'ROCm' in str(gpu_details):\n", | |
| " gpu_type = \"ROCM\"\n", | |
| " else:\n", | |
| " gpu_type = \"CUDA\"\n", | |
| " except Exception:\n", | |
| " gpu_type = \"CUDA\"\n", |
🤖 Prompt for AI Agents
In tests/manual/gpu-test-notebook.ipynb around lines 86 to 99, there is a bare
except that can catch system-exiting exceptions; replace the bare "except:" with
"except Exception:" (or "except Exception as e:" if you want to inspect/log the
error) so only standard exceptions are caught and interrupts/exit signals are
not swallowed.
| "if tensorflow:\n", | ||
| " from tensorflow.python.client import device_lib\n", | ||
| " local_device_protos = device_lib.list_local_devices()\n", | ||
| " gpu_devices = [x.name for x in local_device_protos if x.device_type == 'GPU']\n", | ||
| "else:\n", | ||
| " if torch.cuda.is_available():\n", | ||
| " gpu_count = torch.cuda.device_count()\n", | ||
| " gpu_devices = [f\"GPU:{i} ({torch.cuda.get_device_name(i)})\" for i in range(gpu_count)]\n", | ||
| "\n", | ||
| "for gpu in gpu_devices:\n", | ||
| " print(f\"- {gpu}\")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against potential NameError when torch is not imported.
If both framework imports fail during setup, the else branch at line 222 will execute but torch may not be defined, causing a NameError at line 223.
Apply this diff to add a safety check:
if tensorflow:
from tensorflow.python.client import device_lib
local_device_protos = device_lib.list_local_devices()
gpu_devices = [x.name for x in local_device_protos if x.device_type == 'GPU']
-else:
+elif pytorch:
if torch.cuda.is_available():
gpu_count = torch.cuda.device_count()
gpu_devices = [f"GPU:{i} ({torch.cuda.get_device_name(i)})\" for i in range(gpu_count)]
+else:
+ gpu_devices = []📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "if tensorflow:\n", | |
| " from tensorflow.python.client import device_lib\n", | |
| " local_device_protos = device_lib.list_local_devices()\n", | |
| " gpu_devices = [x.name for x in local_device_protos if x.device_type == 'GPU']\n", | |
| "else:\n", | |
| " if torch.cuda.is_available():\n", | |
| " gpu_count = torch.cuda.device_count()\n", | |
| " gpu_devices = [f\"GPU:{i} ({torch.cuda.get_device_name(i)})\" for i in range(gpu_count)]\n", | |
| "\n", | |
| "for gpu in gpu_devices:\n", | |
| " print(f\"- {gpu}\")" | |
| "if tensorflow:\n", | |
| " from tensorflow.python.client import device_lib\n", | |
| " local_device_protos = device_lib.list_local_devices()\n", | |
| " gpu_devices = [x.name for x in local_device_protos if x.device_type == 'GPU']\n", | |
| "elif pytorch:\n", | |
| " if torch.cuda.is_available():\n", | |
| " gpu_count = torch.cuda.device_count()\n", | |
| " gpu_devices = [f\"GPU:{i} ({torch.cuda.get_device_name(i)})\" for i in range(gpu_count)]\n", | |
| "else:\n", | |
| " gpu_devices = []\n", | |
| "\n", | |
| "for gpu in gpu_devices:\n", | |
| " print(f\"- {gpu}\")" |
🤖 Prompt for AI Agents
In tests/manual/gpu-test-notebook.ipynb around lines 218 to 228, the else branch
assumes torch is defined which can raise a NameError if torch was not imported;
initialize gpu_devices = [] before the branching and replace the torch usage
with a safe check such as verifying "torch" exists in globals() (or try
importing torch inside the branch) and then checking torch.cuda.is_available();
if torch is missing or CUDA not available leave gpu_devices empty to avoid
exceptions and still allow the for-loop to run safely.
|
@jiridanek: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
https://issues.redhat.com/browse/RHAIENG-1512
https://issues.redhat.com/browse/RHAIENG-406
Description
GPU Test Notebook
This notebook aims to provide a very basic testing perspective on Jupyter notebooks with GPU support, in such way that:
PyTorch Test Notebook
This notebook aims to provide a very basic testing perspective on Jupyter notebooks with PyTorch installed, in such way that it will execute properly the following PyTorch scripts:
How Has This Been Tested?
https://issues.redhat.com/browse/RHAIENG-1512?focusedId=28655065&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-28655065
Self checklist (all need to be checked):
make test(gmakeon macOS) before asking for reviewDockerfile.konfluxfiles should be done inodh/notebooksand automatically synced torhds/notebooks. For Konflux-specific changes, modifyDockerfile.konfluxfiles directly inrhds/notebooksas these require special attention in the downstream repository and flow to the upcoming RHOAI release.Merge criteria:
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.