Skip to content

Conversation

@jiridanek
Copy link
Member

@jiridanek jiridanek commented Dec 12, 2025

https://issues.redhat.com/browse/RHAIENG-1512
https://issues.redhat.com/browse/RHAIENG-406

Description

GPU Test Notebook

This notebook aims to provide a very basic testing perspective on Jupyter notebooks with GPU support, in such way that:

  • Notebook environment setup
  • Verify the installed Python version
  • Find the version of the installed TensorFlow or PyTorch packages
  • Find the GPU on the devices list
  • Check if GPU drivers (nvidia-smi or rocm-smi) loads properly
  • CUDA/ROCm drivers are installed

PyTorch Test Notebook

This notebook aims to provide a very basic testing perspective on Jupyter notebooks with PyTorch installed, in such way that it will execute properly the following PyTorch scripts:

  • PyTorch quickstart for beginners
  • PyTorch tests (basic operations on GPU)

How Has This Been Tested?

https://issues.redhat.com/browse/RHAIENG-1512?focusedId=28655065&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-28655065

Self checklist (all need to be checked):

  • Ensure that you have run make test (gmake on macOS) before asking for review
  • Changes to everything except Dockerfile.konflux files should be done in odh/notebooks and automatically synced to rhds/notebooks. For Konflux-specific changes, modify Dockerfile.konflux files directly in rhds/notebooks as these require special attention in the downstream repository and flow to the upcoming RHOAI release.

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • Tests
    • Added GPU environment testing notebook for framework and GPU detection (CUDA/ROCm support).
    • Added PyTorch testing notebook with model training, evaluation, and GPU operation workflows for comprehensive testing coverage.

✏️ Tip: You can customize this high-level summary in your review settings.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 12, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 12, 2025

Walkthrough

Two new Jupyter notebooks are added to the tests/manual/ directory for GPU environment testing and PyTorch workflows. The GPU test notebook detects installed ML frameworks and GPU hardware configuration. The PyTorch test notebook demonstrates FashionMNIST training, basic tensor operations, and GPU-focused execution workflows.

Changes

Cohort / File(s) Summary
GPU testing and PyTorch demonstrations
tests/manual/gpu-test-notebook.ipynb, tests/manual/pytorch-test-notebook.ipynb
Two new Jupyter notebooks added. GPU test notebook includes environment setup, framework detection (PyTorch/TensorFlow), GPU type identification (CUDA/ROCm/CPU-only), and driver checks (nvidia-smi/rocm-smi). PyTorch notebook includes FashionMNIST training pipeline, model save/load, basic GPU tensor operations, and comprehensive GPU operation demonstrations with device-aware execution and detailed logging.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Review GPU device detection logic and fallback paths for missing frameworks or drivers
  • Verify model training correctness in the PyTorch notebook (loss computation, backpropagation, evaluation)
  • Confirm GPU tensor operations handle device availability gracefully and provide meaningful output for verification
  • Check environment variable setup and framework/GPU type detection logic for edge cases

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: adding documentation and testing notebooks for PyTorch workbenches in the form of two Jupyter notebooks (GPU test and PyTorch test notebooks).
Description check ✅ Passed The description covers the purpose of both notebooks, references relevant issues, includes testing information via a Jira link, and completes the self-checklist and merge criteria sections from the template.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch jiridanek-patch-2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added the review-requested GitHub Bot creates notification on #pr-review-ai-ide-team slack channel label Dec 12, 2025
@jiridanek jiridanek marked this pull request as ready for review December 12, 2025 15:54
@openshift-ci openshift-ci bot requested review from atheo89 and dibryant December 12, 2025 15:54
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 12, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ide-developer
Once this PR has been reviewed and has the lgtm label, please assign jiridanek for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added size/xxl and removed size/xxl labels Dec 12, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6859b5a and 5c28f94.

📒 Files selected for processing (2)
  • tests/manual/gpu-test-notebook.ipynb (1 hunks)
  • tests/manual/pytorch-test-notebook.ipynb (1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.8)
tests/manual/gpu-test-notebook.ipynb

35-35: Do not use bare except

(E722)


57-57: Do not use bare except

(E722)

tests/manual/pytorch-test-notebook.ipynb

97-97: Comment contains ambiguous (RIGHT SINGLE QUOTATION MARK). Did you mean ``` (GRAVE ACCENT)?

(RUF003)


119-119: Comment contains ambiguous (RIGHT SINGLE QUOTATION MARK). Did you mean ``` (GRAVE ACCENT)?

(RUF003)


348-348: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (1)
tests/manual/pytorch-test-notebook.ipynb (1)

216-217: The torch.accelerator API is appropriate for this project. It was introduced in PyTorch 2.6, and your project uses PyTorch 2.7.1, which fully supports this modern unified device detection API. No changes are needed.

Comment on lines +62 to +85
"def import_pytorch():\n",
" global framework, gpu_type\n",
" import torch\n",
" framework = \"PYTORCH\"\n",
"\n",
" if torch.cuda.is_available():\n",
" # Check if it's actually ROCm by looking at device name\n",
" try:\n",
" device_name = torch.cuda.get_device_name(0)\n",
"\n",
" # Check for AMD/ROCm indicators in device name\n",
" if any(keyword in device_name.upper() for keyword in ['AMD', 'RADEON', 'INSTINCT', 'MI300', 'MI250', 'MI100']):\n",
" gpu_type = \"ROCM\"\n",
" else:\n",
" gpu_type = \"CUDA\"\n",
" except:\n",
" # Fallback: check if ROCm version exists\n",
" if hasattr(torch.version, 'hip') and torch.version.hip is not None:\n",
" gpu_type = \"ROCM\"\n",
" else:\n",
" gpu_type = \"CUDA\"\n",
" elif hasattr(torch.version, 'hip') and torch.version.hip is not None:\n",
" gpu_type = \"ROCM\"\n",
"\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Replace bare except with specific exception type.

The bare except clause at line 77 will catch system exits and keyboard interrupts, which is usually unintended. Catch Exception instead to allow system signals to propagate.

Apply this diff to improve error handling:

         try:
             device_name = torch.cuda.get_device_name(0)

             # Check for AMD/ROCm indicators in device name
             if any(keyword in device_name.upper() for keyword in ['AMD', 'RADEON', 'INSTINCT', 'MI300', 'MI250', 'MI100']):
                 gpu_type = "ROCM"
             else:
                 gpu_type = "CUDA"
-        except:
+        except Exception:
             # Fallback: check if ROCm version exists
             if hasattr(torch.version, 'hip') and torch.version.hip is not None:
                 gpu_type = "ROCM"
             else:
                 gpu_type = "CUDA"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"def import_pytorch():\n",
" global framework, gpu_type\n",
" import torch\n",
" framework = \"PYTORCH\"\n",
"\n",
" if torch.cuda.is_available():\n",
" # Check if it's actually ROCm by looking at device name\n",
" try:\n",
" device_name = torch.cuda.get_device_name(0)\n",
"\n",
" # Check for AMD/ROCm indicators in device name\n",
" if any(keyword in device_name.upper() for keyword in ['AMD', 'RADEON', 'INSTINCT', 'MI300', 'MI250', 'MI100']):\n",
" gpu_type = \"ROCM\"\n",
" else:\n",
" gpu_type = \"CUDA\"\n",
" except:\n",
" # Fallback: check if ROCm version exists\n",
" if hasattr(torch.version, 'hip') and torch.version.hip is not None:\n",
" gpu_type = \"ROCM\"\n",
" else:\n",
" gpu_type = \"CUDA\"\n",
" elif hasattr(torch.version, 'hip') and torch.version.hip is not None:\n",
" gpu_type = \"ROCM\"\n",
"\n",
"def import_pytorch():\n",
" global framework, gpu_type\n",
" import torch\n",
" framework = \"PYTORCH\"\n",
"\n",
" if torch.cuda.is_available():\n",
" # Check if it's actually ROCm by looking at device name\n",
" try:\n",
" device_name = torch.cuda.get_device_name(0)\n",
"\n",
" # Check for AMD/ROCm indicators in device name\n",
" if any(keyword in device_name.upper() for keyword in ['AMD', 'RADEON', 'INSTINCT', 'MI300', 'MI250', 'MI100']):\n",
" gpu_type = \"ROCM\"\n",
" else:\n",
" gpu_type = \"CUDA\"\n",
" except Exception:\n",
" # Fallback: check if ROCm version exists\n",
" if hasattr(torch.version, 'hip') and torch.version.hip is not None:\n",
" gpu_type = \"ROCM\"\n",
" else:\n",
" gpu_type = \"CUDA\"\n",
" elif hasattr(torch.version, 'hip') and torch.version.hip is not None:\n",
" gpu_type = \"ROCM\"\n",
🤖 Prompt for AI Agents
In tests/manual/gpu-test-notebook.ipynb around lines 62 to 85, replace the bare
"except:" at line ~77 with a specific exception handler to avoid catching
system-exiting exceptions; change it to "except Exception:" or "except Exception
as e:" (and use e if you want to inspect/log it) so only regular exceptions are
caught and KeyboardInterrupt/SystemExit can propagate.

Comment on lines +86 to +99
"def import_tensorflow():\n",
" global framework, gpu_type\n",
" import tensorflow as tf\n",
" framework = \"TENSORFLOW\"\n",
"\n",
" if len(tf.config.list_physical_devices('GPU')) > 0:\n",
" try:\n",
" gpu_details = tf.config.experimental.get_device_details(tf.config.list_physical_devices('GPU')[0])\n",
" if 'AMD' in str(gpu_details) or 'ROCm' in str(gpu_details):\n",
" gpu_type = \"ROCM\"\n",
" else:\n",
" gpu_type = \"CUDA\"\n",
" except:\n",
" gpu_type = \"CUDA\"\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Replace bare except with specific exception type.

The bare except clause at line 98 will catch system exits and keyboard interrupts. Use Exception instead.

Apply this diff:

         try:
             gpu_details = tf.config.experimental.get_device_details(tf.config.list_physical_devices('GPU')[0])
             if 'AMD' in str(gpu_details) or 'ROCm' in str(gpu_details):
                 gpu_type = "ROCM"
             else:
                 gpu_type = "CUDA"
-        except:
+        except Exception:
             gpu_type = "CUDA"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"def import_tensorflow():\n",
" global framework, gpu_type\n",
" import tensorflow as tf\n",
" framework = \"TENSORFLOW\"\n",
"\n",
" if len(tf.config.list_physical_devices('GPU')) > 0:\n",
" try:\n",
" gpu_details = tf.config.experimental.get_device_details(tf.config.list_physical_devices('GPU')[0])\n",
" if 'AMD' in str(gpu_details) or 'ROCm' in str(gpu_details):\n",
" gpu_type = \"ROCM\"\n",
" else:\n",
" gpu_type = \"CUDA\"\n",
" except:\n",
" gpu_type = \"CUDA\"\n",
"def import_tensorflow():\n",
" global framework, gpu_type\n",
" import tensorflow as tf\n",
" framework = \"TENSORFLOW\"\n",
"\n",
" if len(tf.config.list_physical_devices('GPU')) > 0:\n",
" try:\n",
" gpu_details = tf.config.experimental.get_device_details(tf.config.list_physical_devices('GPU')[0])\n",
" if 'AMD' in str(gpu_details) or 'ROCm' in str(gpu_details):\n",
" gpu_type = \"ROCM\"\n",
" else:\n",
" gpu_type = \"CUDA\"\n",
" except Exception:\n",
" gpu_type = \"CUDA\"\n",
🤖 Prompt for AI Agents
In tests/manual/gpu-test-notebook.ipynb around lines 86 to 99, there is a bare
except that can catch system-exiting exceptions; replace the bare "except:" with
"except Exception:" (or "except Exception as e:" if you want to inspect/log the
error) so only standard exceptions are caught and interrupts/exit signals are
not swallowed.

Comment on lines +218 to +228
"if tensorflow:\n",
" from tensorflow.python.client import device_lib\n",
" local_device_protos = device_lib.list_local_devices()\n",
" gpu_devices = [x.name for x in local_device_protos if x.device_type == 'GPU']\n",
"else:\n",
" if torch.cuda.is_available():\n",
" gpu_count = torch.cuda.device_count()\n",
" gpu_devices = [f\"GPU:{i} ({torch.cuda.get_device_name(i)})\" for i in range(gpu_count)]\n",
"\n",
"for gpu in gpu_devices:\n",
" print(f\"- {gpu}\")"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard against potential NameError when torch is not imported.

If both framework imports fail during setup, the else branch at line 222 will execute but torch may not be defined, causing a NameError at line 223.

Apply this diff to add a safety check:

 if tensorflow:
     from tensorflow.python.client import device_lib
     local_device_protos = device_lib.list_local_devices()
     gpu_devices = [x.name for x in local_device_protos if x.device_type == 'GPU']
-else:
+elif pytorch:
     if torch.cuda.is_available():
         gpu_count = torch.cuda.device_count()
         gpu_devices = [f"GPU:{i} ({torch.cuda.get_device_name(i)})\" for i in range(gpu_count)]
+else:
+    gpu_devices = []
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"if tensorflow:\n",
" from tensorflow.python.client import device_lib\n",
" local_device_protos = device_lib.list_local_devices()\n",
" gpu_devices = [x.name for x in local_device_protos if x.device_type == 'GPU']\n",
"else:\n",
" if torch.cuda.is_available():\n",
" gpu_count = torch.cuda.device_count()\n",
" gpu_devices = [f\"GPU:{i} ({torch.cuda.get_device_name(i)})\" for i in range(gpu_count)]\n",
"\n",
"for gpu in gpu_devices:\n",
" print(f\"- {gpu}\")"
"if tensorflow:\n",
" from tensorflow.python.client import device_lib\n",
" local_device_protos = device_lib.list_local_devices()\n",
" gpu_devices = [x.name for x in local_device_protos if x.device_type == 'GPU']\n",
"elif pytorch:\n",
" if torch.cuda.is_available():\n",
" gpu_count = torch.cuda.device_count()\n",
" gpu_devices = [f\"GPU:{i} ({torch.cuda.get_device_name(i)})\" for i in range(gpu_count)]\n",
"else:\n",
" gpu_devices = []\n",
"\n",
"for gpu in gpu_devices:\n",
" print(f\"- {gpu}\")"
🤖 Prompt for AI Agents
In tests/manual/gpu-test-notebook.ipynb around lines 218 to 228, the else branch
assumes torch is defined which can raise a NameError if torch was not imported;
initialize gpu_devices = [] before the branching and replace the torch usage
with a safe check such as verifying "torch" exists in globals() (or try
importing torch inside the branch) and then checking torch.cuda.is_available();
if torch is missing or CUDA not available leave gpu_devices empty to avoid
exceptions and still allow the for-loop to run safely.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 12, 2025

@jiridanek: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/images 5c28f94 link true /test images

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@jiridanek jiridanek merged commit d647c4f into main Dec 16, 2025
20 of 23 checks passed
@jiridanek jiridanek deleted the jiridanek-patch-2 branch December 16, 2025 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm review-requested GitHub Bot creates notification on #pr-review-ai-ide-team slack channel size/xxl

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants