Skip to content

Conversation

@shvbsle
Copy link
Contributor

@shvbsle shvbsle commented Oct 3, 2025

Issue #, if available:

Description of changes:

  1. A complementary test case for EKS Nvidia GPU AMIs that support vGPU software:
    fix(al2023): install grid drivers from ec2 grid runfile awslabs/amazon-eks-ami#2450
  2. Increased the timeout for unit-tests to 10-minutes since unit-tests kept exceeding the default timeout of 5 minutes by a few seconds as seen below. This causes flaky unit-test behavior.

This unit-test checks if a vGPU instances have nvidia vGPU drivers with a valid Licensed

Testing done:

  1. unit-tests pass on non-vGPU instances and skip the vGPU tests:
> go test ./nvidia --test.timeout=60m --test.v --test.run=^TestSingleNodeUnitTest$/unit-test --nvidiaTestImage=${TEST_IMAGE}  --nodeType=g5.8xlarge --tags=e2e
=== RUN   TestSingleNodeUnitTest
=== RUN   TestSingleNodeUnitTest/unit-test
=== RUN   TestSingleNodeUnitTest/unit-test/Unit_test_Job_succeeds
=== NAME  TestSingleNodeUnitTest/unit-test
    unit_test.go:84: Test log for unit-test-job:
    unit_test.go:85: # Running tests in gpu_unit_tests/tests/test_basic.sh
        ok - test_01_device_query
        ok - test_02_vector_add
        ok - test_03_nvbandwidth
        ok - test_04_dcgm_diagnostics
        # Running tests in gpu_unit_tests/tests/test_sysinfo.sh
        ok - test_numa_topo_topo
        ok - test_nvidia_gpu_count
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
        ok - test_nvidia_gpu_throttled
        ok - test_nvidia_gpu_unused
        ok - test_nvidia_persistence_status
        ok - test_nvidia_smi_topo
          % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                         Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
        ok - test_nvidia_vgpu_license_status # skip This test only applies to vGPU instances (g6f.*, gr6f.*)
        
--- PASS: TestSingleNodeUnitTest (370.10s)
    --- PASS: TestSingleNodeUnitTest/unit-test (370.10s)
        --- PASS: TestSingleNodeUnitTest/unit-test/Unit_test_Job_succeeds (370.02s)
PASS
ok      github.com/aws/aws-k8s-tester/test/cases/nvidia 382.604s
  1. Unit tests pass on g6f.4xlarge instance with vGPU
·> go test ./nvidia --test.timeout=60m --test.v --test.run=^TestSingleNodeUnitTest$/unit-test --nvidiaTestImage=${TEST_IMAGE}  --nodeType=g6f.4xlarge --tags=e2e
=== RUN   TestSingleNodeUnitTest
=== RUN   TestSingleNodeUnitTest/unit-test
=== RUN   TestSingleNodeUnitTest/unit-test/Unit_test_Job_succeeds
=== NAME  TestSingleNodeUnitTest/unit-test
    unit_test.go:84: Test log for unit-test-job:
    unit_test.go:85: # Running tests in gpu_unit_tests/tests/test_basic.sh
        ok - test_01_device_query
        ok - test_02_vector_add
        ok - test_03_nvbandwidth
        ok - test_04_dcgm_diagnostics # skip This test does not apply to vGPU instances (g6f.*, gr6f.*)
        # Running tests in gpu_unit_tests/tests/test_sysinfo.sh
        ok - test_numa_topo_topo
        ok - test_nvidia_gpu_count
        ok - test_nvidia_gpu_throttled # skip This test does not apply to vGPU instances (g6f.*, gr6f.*)
        ok - test_nvidia_gpu_unused
        ok - test_nvidia_persistence_status
        ok - test_nvidia_smi_topo
        ok - test_nvidia_vgpu_license_status
        
--- PASS: TestSingleNodeUnitTest (155.13s)
    --- PASS: TestSingleNodeUnitTest/unit-test (155.13s)
        --- PASS: TestSingleNodeUnitTest/unit-test/Unit_test_Job_succeeds (155.02s)
PASS
ok      github.com/aws/aws-k8s-tester/test/cases/nvidia 168.188s

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@shvbsle shvbsle marked this pull request as draft October 3, 2025 23:23
…nces and increase unit test timeout to 10 mins
@shvbsle shvbsle marked this pull request as ready for review October 22, 2025 06:17
@shvbsle shvbsle requested review from mselim00 and ndbaker1 October 22, 2025 06:17
@shvbsle shvbsle merged commit 270c7b8 into aws:main Oct 23, 2025
10 checks passed
@shvbsle shvbsle deleted the licensecheck branch October 23, 2025 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants