Skip to content

Commit df0af13

Browse files
committed
feat: Add GPU environment evaluation functions
This commit introduces a comprehensive set of functions to detect and evaluate the GPU environment on a Dataproc node. These functions are designed to replace the existing OS/version checks and hardcoded assumptions within the spark-rapids initialization action. The new `evaluate_gpu_setup` function and its helpers will: - Detect GPU hardware, NVIDIA drivers, and CUDA toolkit versions. - Check for the presence of key system packages (dkms, headers, etc.). - Inspect Conda environments for GPU-related packages (TensorFlow, PyTorch, XGBoost, etc.). - Verify Spark RAPIDS JAR installations. - Check YARN and Spark configurations related to GPU resources. - Assess secure boot status and MOK key enrollment. The output is a JSON file (`/tmp/gpu_evaluation.json`) summarizing the environment, which will be used by subsequent refactored parts of the init action to make informed decisions about driver installation, package management, and configuration. This change is the first step in refactoring the spark-rapids init action to be more robust and less dependent on image-specific details.
1 parent 54d7aaa commit df0af13

1 file changed

Lines changed: 1697 additions & 598 deletions

File tree

0 commit comments

Comments
 (0)