RHAIENG-1512, RHAIENG-406: docs(tests): document developer-centric testing procedures for opendatahub-io/notebooks pytorch workbenches #2771

jiridanek · 2025-12-12T15:53:43Z

https://issues.redhat.com/browse/RHAIENG-1512
https://issues.redhat.com/browse/RHAIENG-406

Description

GPU Test Notebook

This notebook aims to provide a very basic testing perspective on Jupyter notebooks with GPU support, in such way that:

Notebook environment setup
Verify the installed Python version
Find the version of the installed TensorFlow or PyTorch packages
Find the GPU on the devices list
Check if GPU drivers (nvidia-smi or rocm-smi) loads properly
CUDA/ROCm drivers are installed

PyTorch Test Notebook

This notebook aims to provide a very basic testing perspective on Jupyter notebooks with PyTorch installed, in such way that it will execute properly the following PyTorch scripts:

PyTorch quickstart for beginners
PyTorch tests (basic operations on GPU)

How Has This Been Tested?

https://issues.redhat.com/browse/RHAIENG-1512?focusedId=28655065&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-28655065

Self checklist (all need to be checked):

Ensure that you have run make test (gmake on macOS) before asking for review
Changes to everything except Dockerfile.konflux files should be done in odh/notebooks and automatically synced to rhds/notebooks. For Konflux-specific changes, modify Dockerfile.konflux files directly in rhds/notebooks as these require special attention in the downstream repository and flow to the upcoming RHOAI release.

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

Tests
- Added GPU environment testing notebook for framework and GPU detection (CUDA/ROCm support).
- Added PyTorch testing notebook with model training, evaluation, and GPU operation workflows for comprehensive testing coverage.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…sting procedures for opendatahub-io/notebooks pytorch workbenches https://issues.redhat.com/browse/RHAIENG-1512 https://issues.redhat.com/browse/RHAIENG-406

openshift-ci · 2025-12-12T15:53:47Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2025-12-12T15:53:49Z

Walkthrough

Two new Jupyter notebooks are added to the tests/manual/ directory for GPU environment testing and PyTorch workflows. The GPU test notebook detects installed ML frameworks and GPU hardware configuration. The PyTorch test notebook demonstrates FashionMNIST training, basic tensor operations, and GPU-focused execution workflows.

Changes

Cohort / File(s)	Summary
GPU testing and PyTorch demonstrations `tests/manual/gpu-test-notebook.ipynb`, `tests/manual/pytorch-test-notebook.ipynb`	Two new Jupyter notebooks added. GPU test notebook includes environment setup, framework detection (PyTorch/TensorFlow), GPU type identification (CUDA/ROCm/CPU-only), and driver checks (nvidia-smi/rocm-smi). PyTorch notebook includes FashionMNIST training pipeline, model save/load, basic GPU tensor operations, and comprehensive GPU operation demonstrations with device-aware execution and detailed logging.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Review GPU device detection logic and fallback paths for missing frameworks or drivers
Verify model training correctness in the PyTorch notebook (loss computation, backpropagation, evaluation)
Confirm GPU tensor operations handle device availability gracefully and provide meaningful output for verification
Check environment variable setup and framework/GPU type detection logic for edge cases

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: adding documentation and testing notebooks for PyTorch workbenches in the form of two Jupyter notebooks (GPU test and PyTorch test notebooks).
Description check	✅ Passed	The description covers the purpose of both notebooks, references relevant issues, includes testing information via a Jira link, and completes the self-checklist and merge criteria sections from the template.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch jiridanek-patch-2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2025-12-12T15:54:21Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ide-developer
Once this PR has been reviewed and has the lgtm label, please assign jiridanek for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 3

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6859b5a and 5c28f94.

📒 Files selected for processing (2)

tests/manual/gpu-test-notebook.ipynb (1 hunks)
tests/manual/pytorch-test-notebook.ipynb (1 hunks)

🧰 Additional context used

🪛 Ruff (0.14.8)

tests/manual/gpu-test-notebook.ipynb

35-35: Do not use bare except

(E722)

57-57: Do not use bare except

(E722)

tests/manual/pytorch-test-notebook.ipynb

97-97: Comment contains ambiguous ’ (RIGHT SINGLE QUOTATION MARK). Did you mean ``` (GRAVE ACCENT)?

(RUF003)

119-119: Comment contains ambiguous ’ (RIGHT SINGLE QUOTATION MARK). Did you mean ``` (GRAVE ACCENT)?

(RUF003)

348-348: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (1)

tests/manual/pytorch-test-notebook.ipynb (1)

216-217: The torch.accelerator API is appropriate for this project. It was introduced in PyTorch 2.6, and your project uses PyTorch 2.7.1, which fully supports this modern unified device detection API. No changes are needed.

coderabbitai · 2025-12-12T16:02:01Z

tests/manual/gpu-test-notebook.ipynb

+    "def import_pytorch():\n",
+    "    global framework, gpu_type\n",
+    "    import torch\n",
+    "    framework = \"PYTORCH\"\n",
+    "\n",
+    "    if torch.cuda.is_available():\n",
+    "        # Check if it's actually ROCm by looking at device name\n",
+    "        try:\n",
+    "            device_name = torch.cuda.get_device_name(0)\n",
+    "\n",
+    "            # Check for AMD/ROCm indicators in device name\n",
+    "            if any(keyword in device_name.upper() for keyword in ['AMD', 'RADEON', 'INSTINCT', 'MI300', 'MI250', 'MI100']):\n",
+    "                gpu_type = \"ROCM\"\n",
+    "            else:\n",
+    "                gpu_type = \"CUDA\"\n",
+    "        except:\n",
+    "            # Fallback: check if ROCm version exists\n",
+    "            if hasattr(torch.version, 'hip') and torch.version.hip is not None:\n",
+    "                gpu_type = \"ROCM\"\n",
+    "            else:\n",
+    "                gpu_type = \"CUDA\"\n",
+    "    elif hasattr(torch.version, 'hip') and torch.version.hip is not None:\n",
+    "        gpu_type = \"ROCM\"\n",
+    "\n",


⚠️ Potential issue | 🟡 Minor

Replace bare except with specific exception type.

The bare except clause at line 77 will catch system exits and keyboard interrupts, which is usually unintended. Catch Exception instead to allow system signals to propagate.

Apply this diff to improve error handling:

try: device_name = torch.cuda.get_device_name(0) # Check for AMD/ROCm indicators in device name if any(keyword in device_name.upper() for keyword in ['AMD', 'RADEON', 'INSTINCT', 'MI300', 'MI250', 'MI100']): gpu_type = "ROCM" else: gpu_type = "CUDA" - except: + except Exception: # Fallback: check if ROCm version exists if hasattr(torch.version, 'hip') and torch.version.hip is not None: gpu_type = "ROCM" else: gpu_type = "CUDA"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"def import_pytorch():\n",

" global framework, gpu_type\n",

" import torch\n",

" framework = \"PYTORCH\"\n",

"\n",

" if torch.cuda.is_available():\n",

" # Check if it's actually ROCm by looking at device name\n",

" try:\n",

" device_name = torch.cuda.get_device_name(0)\n",

"\n",

" # Check for AMD/ROCm indicators in device name\n",

" if any(keyword in device_name.upper() for keyword in ['AMD', 'RADEON', 'INSTINCT', 'MI300', 'MI250', 'MI100']):\n",

" gpu_type = \"ROCM\"\n",

" else:\n",

" gpu_type = \"CUDA\"\n",

" except:\n",

" # Fallback: check if ROCm version exists\n",

" if hasattr(torch.version, 'hip') and torch.version.hip is not None:\n",

" gpu_type = \"ROCM\"\n",

" else:\n",

" gpu_type = \"CUDA\"\n",

" elif hasattr(torch.version, 'hip') and torch.version.hip is not None:\n",

" gpu_type = \"ROCM\"\n",

"\n",

"def import_pytorch():\n",

" global framework, gpu_type\n",

" import torch\n",

" framework = \"PYTORCH\"\n",

"\n",

" if torch.cuda.is_available():\n",

" # Check if it's actually ROCm by looking at device name\n",

" try:\n",

" device_name = torch.cuda.get_device_name(0)\n",

"\n",

" # Check for AMD/ROCm indicators in device name\n",

" if any(keyword in device_name.upper() for keyword in ['AMD', 'RADEON', 'INSTINCT', 'MI300', 'MI250', 'MI100']):\n",

" gpu_type = \"ROCM\"\n",

" else:\n",

" gpu_type = \"CUDA\"\n",

" except Exception:\n",

" # Fallback: check if ROCm version exists\n",

" if hasattr(torch.version, 'hip') and torch.version.hip is not None:\n",

" gpu_type = \"ROCM\"\n",

" else:\n",

" gpu_type = \"CUDA\"\n",

" elif hasattr(torch.version, 'hip') and torch.version.hip is not None:\n",

" gpu_type = \"ROCM\"\n",

🤖 Prompt for AI Agents

In tests/manual/gpu-test-notebook.ipynb around lines 62 to 85, replace the bare "except:" at line ~77 with a specific exception handler to avoid catching system-exiting exceptions; change it to "except Exception:" or "except Exception as e:" (and use e if you want to inspect/log it) so only regular exceptions are caught and KeyboardInterrupt/SystemExit can propagate.

coderabbitai · 2025-12-12T16:02:01Z

tests/manual/gpu-test-notebook.ipynb

+    "def import_tensorflow():\n",
+    "    global framework, gpu_type\n",
+    "    import tensorflow as tf\n",
+    "    framework = \"TENSORFLOW\"\n",
+    "\n",
+    "    if len(tf.config.list_physical_devices('GPU')) > 0:\n",
+    "        try:\n",
+    "            gpu_details = tf.config.experimental.get_device_details(tf.config.list_physical_devices('GPU')[0])\n",
+    "            if 'AMD' in str(gpu_details) or 'ROCm' in str(gpu_details):\n",
+    "                gpu_type = \"ROCM\"\n",
+    "            else:\n",
+    "                gpu_type = \"CUDA\"\n",
+    "        except:\n",
+    "            gpu_type = \"CUDA\"\n",


⚠️ Potential issue | 🟡 Minor

Replace bare except with specific exception type.

The bare except clause at line 98 will catch system exits and keyboard interrupts. Use Exception instead.

Apply this diff:

try: gpu_details = tf.config.experimental.get_device_details(tf.config.list_physical_devices('GPU')[0]) if 'AMD' in str(gpu_details) or 'ROCm' in str(gpu_details): gpu_type = "ROCM" else: gpu_type = "CUDA" - except: + except Exception: gpu_type = "CUDA"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"def import_tensorflow():\n",

" global framework, gpu_type\n",

" import tensorflow as tf\n",

" framework = \"TENSORFLOW\"\n",

"\n",

" if len(tf.config.list_physical_devices('GPU')) > 0:\n",

" try:\n",

" gpu_details = tf.config.experimental.get_device_details(tf.config.list_physical_devices('GPU')[0])\n",

" if 'AMD' in str(gpu_details) or 'ROCm' in str(gpu_details):\n",

" gpu_type = \"ROCM\"\n",

" else:\n",

" gpu_type = \"CUDA\"\n",

" except:\n",

" gpu_type = \"CUDA\"\n",

"def import_tensorflow():\n",

" global framework, gpu_type\n",

" import tensorflow as tf\n",

" framework = \"TENSORFLOW\"\n",

"\n",

" if len(tf.config.list_physical_devices('GPU')) > 0:\n",

" try:\n",

" gpu_details = tf.config.experimental.get_device_details(tf.config.list_physical_devices('GPU')[0])\n",

" if 'AMD' in str(gpu_details) or 'ROCm' in str(gpu_details):\n",

" gpu_type = \"ROCM\"\n",

" else:\n",

" gpu_type = \"CUDA\"\n",

" except Exception:\n",

" gpu_type = \"CUDA\"\n",

🤖 Prompt for AI Agents

In tests/manual/gpu-test-notebook.ipynb around lines 86 to 99, there is a bare except that can catch system-exiting exceptions; replace the bare "except:" with "except Exception:" (or "except Exception as e:" if you want to inspect/log the error) so only standard exceptions are caught and interrupts/exit signals are not swallowed.

coderabbitai · 2025-12-12T16:02:01Z

tests/manual/gpu-test-notebook.ipynb

+    "if tensorflow:\n",
+    "    from tensorflow.python.client import device_lib\n",
+    "    local_device_protos = device_lib.list_local_devices()\n",
+    "    gpu_devices = [x.name for x in local_device_protos if x.device_type == 'GPU']\n",
+    "else:\n",
+    "    if torch.cuda.is_available():\n",
+    "        gpu_count = torch.cuda.device_count()\n",
+    "        gpu_devices = [f\"GPU:{i} ({torch.cuda.get_device_name(i)})\" for i in range(gpu_count)]\n",
+    "\n",
+    "for gpu in gpu_devices:\n",
+    "    print(f\"- {gpu}\")"


⚠️ Potential issue | 🟠 Major

Guard against potential NameError when torch is not imported.

If both framework imports fail during setup, the else branch at line 222 will execute but torch may not be defined, causing a NameError at line 223.

Apply this diff to add a safety check:

if tensorflow: from tensorflow.python.client import device_lib local_device_protos = device_lib.list_local_devices() gpu_devices = [x.name for x in local_device_protos if x.device_type == 'GPU'] -else: +elif pytorch: if torch.cuda.is_available(): gpu_count = torch.cuda.device_count() gpu_devices = [f"GPU:{i} ({torch.cuda.get_device_name(i)})\" for i in range(gpu_count)] +else: + gpu_devices = []

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"if tensorflow:\n",

" from tensorflow.python.client import device_lib\n",

" local_device_protos = device_lib.list_local_devices()\n",

" gpu_devices = [x.name for x in local_device_protos if x.device_type == 'GPU']\n",

"else:\n",

" if torch.cuda.is_available():\n",

" gpu_count = torch.cuda.device_count()\n",

" gpu_devices = [f\"GPU:{i} ({torch.cuda.get_device_name(i)})\" for i in range(gpu_count)]\n",

"\n",

"for gpu in gpu_devices:\n",

" print(f\"- {gpu}\")"

"if tensorflow:\n",

" from tensorflow.python.client import device_lib\n",

" local_device_protos = device_lib.list_local_devices()\n",

" gpu_devices = [x.name for x in local_device_protos if x.device_type == 'GPU']\n",

"elif pytorch:\n",

" if torch.cuda.is_available():\n",

" gpu_count = torch.cuda.device_count()\n",

" gpu_devices = [f\"GPU:{i} ({torch.cuda.get_device_name(i)})\" for i in range(gpu_count)]\n",

"else:\n",

" gpu_devices = []\n",

"\n",

"for gpu in gpu_devices:\n",

" print(f\"- {gpu}\")"

🤖 Prompt for AI Agents

In tests/manual/gpu-test-notebook.ipynb around lines 218 to 228, the else branch assumes torch is defined which can raise a NameError if torch was not imported; initialize gpu_devices = [] before the branching and replace the torch usage with a safe check such as verifying "torch" exists in globals() (or try importing torch inside the branch) and then checking torch.cuda.is_available(); if torch is missing or CUDA not available leave gpu_devices empty to avoid exceptions and still allow the for-loop to run safely.

openshift-ci · 2025-12-12T16:17:03Z

@jiridanek: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/images	`5c28f94`	link	true	`/test images`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

RHAIENG-1512, RHAIENG-406: docs(tests): document developer-centric te…

5c28f94

…sting procedures for opendatahub-io/notebooks pytorch workbenches https://issues.redhat.com/browse/RHAIENG-1512 https://issues.redhat.com/browse/RHAIENG-406

openshift-ci bot added the do-not-merge/work-in-progress label Dec 12, 2025

github-actions bot added the review-requested GitHub Bot creates notification on #pr-review-ai-ide-team slack channel label Dec 12, 2025

openshift-ci bot added the size/xxl label Dec 12, 2025

jiridanek marked this pull request as ready for review December 12, 2025 15:54

openshift-ci bot removed the do-not-merge/work-in-progress label Dec 12, 2025

openshift-ci bot requested review from atheo89 and dibryant December 12, 2025 15:54

ide-developer approved these changes Dec 12, 2025

View reviewed changes

openshift-ci bot assigned ide-developer Dec 12, 2025

openshift-ci bot added the lgtm label Dec 12, 2025

openshift-ci bot added size/xxl and removed size/xxl labels Dec 12, 2025

coderabbitai bot reviewed Dec 12, 2025

View reviewed changes

jiridanek merged commit d647c4f into main Dec 16, 2025
20 of 23 checks passed

jiridanek deleted the jiridanek-patch-2 branch December 16, 2025 10:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RHAIENG-1512, RHAIENG-406: docs(tests): document developer-centric testing procedures for opendatahub-io/notebooks pytorch workbenches #2771

RHAIENG-1512, RHAIENG-406: docs(tests): document developer-centric testing procedures for opendatahub-io/notebooks pytorch workbenches #2771

Uh oh!

jiridanek commented Dec 12, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

openshift-ci bot commented Dec 12, 2025

Uh oh!

coderabbitai bot commented Dec 12, 2025 •

edited

Loading

Uh oh!

openshift-ci bot commented Dec 12, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Dec 12, 2025

Uh oh!

coderabbitai bot Dec 12, 2025

Uh oh!

coderabbitai bot Dec 12, 2025

Uh oh!

openshift-ci bot commented Dec 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RHAIENG-1512, RHAIENG-406: docs(tests): document developer-centric testing procedures for opendatahub-io/notebooks pytorch workbenches #2771

RHAIENG-1512, RHAIENG-406: docs(tests): document developer-centric testing procedures for opendatahub-io/notebooks pytorch workbenches #2771

Uh oh!

Conversation

jiridanek commented Dec 12, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

GPU Test Notebook

PyTorch Test Notebook

How Has This Been Tested?

Merge criteria:

Summary by CodeRabbit

Uh oh!

openshift-ci bot commented Dec 12, 2025

Uh oh!

coderabbitai bot commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

openshift-ci bot commented Dec 12, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Dec 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jiridanek commented Dec 12, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 12, 2025 •

edited

Loading