CUDA_ERROR_DEVICE_UNAVAILABLE when starting LocalCUDACluster with UCX on H100 (Numba cuDevicePrimaryCtxRetain)

when trying to launch a LocalCUDACluster, i am running into an issue/error, here is the code:

```
import dask
import time
import gc

from dask_cuda import LocalCUDACluster
from dask.distributed import Client

%%time
preprocessing_gpus="0,1"
cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES=preprocessing_gpus,
                           threads_per_worker=10,
                           protocol="ucx",
                           rmm_pool_size= "10GB",
                           rmm_maximum_pool_size = "70GB",
                           rmm_allocator_external_lib_list= "cupy",
                          )
```

this is the error i am getting: 

```
2025-11-28 08:00:40,746 - distributed.worker - ERROR - Failed to log closing event
Traceback (most recent call last):
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 528, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/utils.py", line 1923, in wait_for
    return await fut
           ^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/worker.py", line 1406, in start_unsafe
    await self.listen(start_address, **kwargs)
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 723, in listen
    listener = await listen(
               ^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/comm/core.py", line 256, in _
    await self.start()
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed_ucxx/ucxx.py", line 751, in start
    init_once()
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed_ucxx/ucxx.py", line 263, in init_once
    numba.cuda.current_context()
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/devices.py", line 225, in get_context
    return _runtime.get_or_create_context(devnum)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/devices.py", line 145, in get_or_create_context
    return self._get_or_create_context_uncached(devnum)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/devices.py", line 162, in _get_or_create_context_uncached
    return self._activate_context_for(0)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/devices.py", line 182, in _activate_context_for
    newctx = gpu.get_primary_context()
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 718, in get_primary_context
    hctx = driver.cuDevicePrimaryCtxRetain(self.id)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 393, in safe_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 456, in _check_cuda_python_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [46] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_DEVICE_UNAVAILABLE

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/worker.py", line 1555, in close
    self.log_event(self.address, {"action": "closing-worker", "reason": reason})
                   ^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 635, in address
    raise ValueError("cannot get address of non-running Server")
ValueError: cannot get address of non-running Server
2025-11-28 08:00:40,752 - distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 528, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/utils.py", line 1923, in wait_for
    return await fut
           ^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/worker.py", line 1406, in start_unsafe
    await self.listen(start_address, **kwargs)
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 723, in listen
    listener = await listen(
               ^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/comm/core.py", line 256, in _
    await self.start()
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed_ucxx/ucxx.py", line 751, in start
    init_once()
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed_ucxx/ucxx.py", line 263, in init_once
    numba.cuda.current_context()
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/devices.py", line 225, in get_context
    return _runtime.get_or_create_context(devnum)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/devices.py", line 145, in get_or_create_context
    return self._get_or_create_context_uncached(devnum)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/devices.py", line 162, in _get_or_create_context_uncached
    return self._activate_context_for(0)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/devices.py", line 182, in _activate_context_for
    newctx = gpu.get_primary_context()
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 718, in get_primary_context
    hctx = driver.cuDevicePrimaryCtxRetain(self.id)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 393, in safe_cuda_api_call
    return self._check_cuda_python_error(fname, libfn(*args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 456, in _check_cuda_python_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [46] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_DEVICE_UNAVAILABLE

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 969, in run
    async with worker:
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 542, in __aenter__
    await self
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 536, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
2025-11-28 08:00:40,803 - distributed.nanny - ERROR - Failed to start process
numba.cuda.cudadrv.driver.CudaAPIError: [46] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_DEVICE_UNAVAILABLE

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 452, in instantiate
    result = await self.process.start()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 759, in start
    msg = await self._wait_until_connected(uid)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 905, in _wait_until_connected
    raise msg["exception"]
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 969, in run
    async with worker:
    ^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 542, in __aenter__
    await self
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 536, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
    ^^^^^^^^^^^^^^^^^
RuntimeError: Worker failed to start.
2025-11-28 08:00:43,359 - tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x14f92e08c290>>, <Task finished name='Task-73' coro=<SpecCluster._correct_state_internal() done, defined at /sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/deploy/spec.py:352> exception=RuntimeError('Nanny failed to start.')>)
numba.cuda.cudadrv.driver.CudaAPIError: [46] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_DEVICE_UNAVAILABLE

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 528, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/utils.py", line 1923, in wait_for
    return await fut
           ^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 369, in start_unsafe
    response = await self.instantiate()
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 452, in instantiate
    result = await self.process.start()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 759, in start
    msg = await self._wait_until_connected(uid)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 905, in _wait_until_connected
    raise msg["exception"]
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 969, in run
    async with worker:
    ^^^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 542, in __aenter__
    await self
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 536, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
    ^^^^^^^^^^^^^^^^^
RuntimeError: Worker failed to start.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/tornado/ioloop.py", line 758, in _run_callback
    ret = callback()
          ^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/tornado/ioloop.py", line 782, in _discard_future_result
    future.result()
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/deploy/spec.py", line 396, in _correct_state_internal
    await asyncio.gather(*worker_futs)
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/asyncio/tasks.py", line 684, in _wrap_awaitable
    return await awaitable
           ^^^^^^^^^^^^^^^
  File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 536, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Nanny failed to start.
```

i checked nvidia-smi and got the following output:

```
Fri Nov 28 08:56:12 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:3D:00.0 Off |                  Off |
| N/A   29C    P0            108W /  700W |     527MiB /  81559MiB |      0%   E. Process |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  |   00000000:4E:00.0 Off |                  Off |
| N/A   26C    P0             65W /  700W |       5MiB /  81559MiB |      0%   E. Process |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2027935      C   ...w_environment/sc_gpu/bin/python3.12        516MiB |
+-----------------------------------------------------------------------------------------+

```

```
	GPU0	GPU1	NIC0	NIC1	NIC2	NIC3	NIC4	NIC5	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV6	PXB	NODE	SYS	SYS	SYS	SYS	0-11	0		N/A
GPU1	NV6	 X 	NODE	PXB	SYS	SYS	SYS	SYS	0-11	0		N/A
NIC0	PXB	NODE	 X 	NODE	SYS	SYS	SYS	SYS				
NIC1	NODE	PXB	NODE	 X 	SYS	SYS	SYS	SYS				
NIC2	SYS	SYS	SYS	SYS	 X 	PIX	NODE	NODE				
NIC3	SYS	SYS	SYS	SYS	PIX	 X 	NODE	NODE				
NIC4	SYS	SYS	SYS	SYS	NODE	NODE	 X 	NODE				
NIC5	SYS	SYS	SYS	SYS	NODE	NODE	NODE	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
```
nvidia-smi -q -d COMPUTE | grep -A2 "Compute Mode

```
    Compute Mode                          : Exclusive_Process

GPU 00000000:4E:00.0
    Compute Mode                          : Exclusive_Process

```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA_ERROR_DEVICE_UNAVAILABLE when starting LocalCUDACluster with UCX on H100 (Numba cuDevicePrimaryCtxRetain) #618

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CUDA_ERROR_DEVICE_UNAVAILABLE when starting LocalCUDACluster with UCX on H100 (Numba cuDevicePrimaryCtxRetain) #618

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions