Skip to content

CUDA_ERROR_DEVICE_UNAVAILABLE when starting LocalCUDACluster with UCX on H100 (Numba cuDevicePrimaryCtxRetain) #618

@AtlasMaroc

Description

@AtlasMaroc

when trying to launch a LocalCUDACluster, i am running into an issue/error, here is the code:

import dask
import time
import gc

from dask_cuda import LocalCUDACluster
from dask.distributed import Client

%%time
preprocessing_gpus="0,1"
cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES=preprocessing_gpus,
                           threads_per_worker=10,
                           protocol="ucx",
                           rmm_pool_size= "10GB",
                           rmm_maximum_pool_size = "70GB",
                           rmm_allocator_external_lib_list= "cupy",
                          )

this is the error i am getting:

2025-11-28 08:00:40,746 - distributed.worker - ERROR - Failed to log closing event
Traceback (most recent call last):
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 528, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/utils.py", line 1923, in wait_for
    return await fut
           ^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/worker.py", line 1406, in start_unsafe
    await self.listen(start_address, **kwargs)
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 723, in listen
    listener = await listen(
               ^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/comm/core.py", line 256, in _
    await self.start()
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed_ucxx/ucxx.py", line 751, in start
    init_once()
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed_ucxx/ucxx.py", line 263, in init_once
    numba.cuda.current_context()
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/devices.py", line 225, in get_context
    return _runtime.get_or_create_context(devnum)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/devices.py", line 145, in get_or_create_context
    return self._get_or_create_context_uncached(devnum)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/devices.py", line 162, in _get_or_create_context_uncached
    return self._activate_context_for(0)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/devices.py", line 182, in _activate_context_for
    newctx = gpu.get_primary_context()
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 718, in get_primary_context
    hctx = driver.cuDevicePrimaryCtxRetain(self.id)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 393, in safe_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 456, in _check_cuda_python_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [46] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_DEVICE_UNAVAILABLE

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/worker.py", line 1555, in close
    self.log_event(self.address, {"action": "closing-worker", "reason": reason})
                   ^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 635, in address
    raise ValueError("cannot get address of non-running Server")
ValueError: cannot get address of non-running Server
2025-11-28 08:00:40,752 - distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 528, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/utils.py", line 1923, in wait_for
    return await fut
           ^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/worker.py", line 1406, in start_unsafe
    await self.listen(start_address, **kwargs)
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 723, in listen
    listener = await listen(
               ^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/comm/core.py", line 256, in _
    await self.start()
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed_ucxx/ucxx.py", line 751, in start
    init_once()
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed_ucxx/ucxx.py", line 263, in init_once
    numba.cuda.current_context()
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/devices.py", line 225, in get_context
    return _runtime.get_or_create_context(devnum)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/devices.py", line 145, in get_or_create_context
    return self._get_or_create_context_uncached(devnum)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/devices.py", line 162, in _get_or_create_context_uncached
    return self._activate_context_for(0)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/devices.py", line 182, in _activate_context_for
    newctx = gpu.get_primary_context()
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 718, in get_primary_context
    hctx = driver.cuDevicePrimaryCtxRetain(self.id)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 393, in safe_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 456, in _check_cuda_python_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [46] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_DEVICE_UNAVAILABLE

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 969, in run
    async with worker:
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 542, in __aenter__
    await self
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 536, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
2025-11-28 08:00:40,803 - distributed.nanny - ERROR - Failed to start process
numba.cuda.cudadrv.driver.CudaAPIError: [46] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_DEVICE_UNAVAILABLE

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 452, in instantiate
    result = await self.process.start()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 759, in start
    msg = await self._wait_until_connected(uid)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 905, in _wait_until_connected
    raise msg["exception"]
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 969, in run
    async with worker:
    ^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 542, in __aenter__
    await self
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 536, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
    ^^^^^^^^^^^^^^^^^
RuntimeError: Worker failed to start.
2025-11-28 08:00:43,359 - tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x14f92e08c290>>, <Task finished name='Task-73' coro=<SpecCluster._correct_state_internal() done, defined at /sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/deploy/spec.py:352> exception=RuntimeError('Nanny failed to start.')>)
numba.cuda.cudadrv.driver.CudaAPIError: [46] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_DEVICE_UNAVAILABLE

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 528, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/utils.py", line 1923, in wait_for
    return await fut
           ^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 369, in start_unsafe
    response = await self.instantiate()
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 452, in instantiate
    result = await self.process.start()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 759, in start
    msg = await self._wait_until_connected(uid)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 905, in _wait_until_connected
    raise msg["exception"]
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 969, in run
    async with worker:
    ^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 542, in __aenter__
    await self
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 536, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
    ^^^^^^^^^^^^^^^^^
RuntimeError: Worker failed to start.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/tornado/ioloop.py", line 758, in _run_callback
    ret = callback()
          ^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/tornado/ioloop.py", line 782, in _discard_future_result
    future.result()
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/deploy/spec.py", line 396, in _correct_state_internal
    await asyncio.gather(*worker_futs)
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/asyncio/tasks.py", line 684, in _wrap_awaitable
    return await awaitable
           ^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 536, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Nanny failed to start.

i checked nvidia-smi and got the following output:

Fri Nov 28 08:56:12 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:3D:00.0 Off |                  Off |
| N/A   29C    P0            108W /  700W |     527MiB /  81559MiB |      0%   E. Process |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  |   00000000:4E:00.0 Off |                  Off |
| N/A   26C    P0             65W /  700W |       5MiB /  81559MiB |      0%   E. Process |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2027935      C   ...w_environment/sc_gpu/bin/python3.12        516MiB |
+-----------------------------------------------------------------------------------------+

	GPU0	GPU1	NIC0	NIC1	NIC2	NIC3	NIC4	NIC5	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV6	PXB	NODE	SYS	SYS	SYS	SYS	0-11	0		N/A
GPU1	NV6	 X 	NODE	PXB	SYS	SYS	SYS	SYS	0-11	0		N/A
NIC0	PXB	NODE	 X 	NODE	SYS	SYS	SYS	SYS				
NIC1	NODE	PXB	NODE	 X 	SYS	SYS	SYS	SYS				
NIC2	SYS	SYS	SYS	SYS	 X 	PIX	NODE	NODE				
NIC3	SYS	SYS	SYS	SYS	PIX	 X 	NODE	NODE				
NIC4	SYS	SYS	SYS	SYS	NODE	NODE	 X 	NODE				
NIC5	SYS	SYS	SYS	SYS	NODE	NODE	NODE	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5

nvidia-smi -q -d COMPUTE | grep -A2 "Compute Mode

    Compute Mode                          : Exclusive_Process

GPU 00000000:4E:00.0
    Compute Mode                          : Exclusive_Process

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions