-
Notifications
You must be signed in to change notification settings - Fork 47
Open
Description
when trying to launch a LocalCUDACluster, i am running into an issue/error, here is the code:
import dask
import time
import gc
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
%%time
preprocessing_gpus="0,1"
cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES=preprocessing_gpus,
threads_per_worker=10,
protocol="ucx",
rmm_pool_size= "10GB",
rmm_maximum_pool_size = "70GB",
rmm_allocator_external_lib_list= "cupy",
)
this is the error i am getting:
2025-11-28 08:00:40,746 - distributed.worker - ERROR - Failed to log closing event
Traceback (most recent call last):
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 528, in start
await wait_for(self.start_unsafe(), timeout=timeout)
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/utils.py", line 1923, in wait_for
return await fut
^^^^^^^^^
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/worker.py", line 1406, in start_unsafe
await self.listen(start_address, **kwargs)
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 723, in listen
listener = await listen(
^^^^^^^^^^^^^
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/comm/core.py", line 256, in _
await self.start()
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed_ucxx/ucxx.py", line 751, in start
init_once()
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed_ucxx/ucxx.py", line 263, in init_once
numba.cuda.current_context()
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/devices.py", line 225, in get_context
return _runtime.get_or_create_context(devnum)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/devices.py", line 145, in get_or_create_context
return self._get_or_create_context_uncached(devnum)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/devices.py", line 162, in _get_or_create_context_uncached
return self._activate_context_for(0)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/devices.py", line 182, in _activate_context_for
newctx = gpu.get_primary_context()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 718, in get_primary_context
hctx = driver.cuDevicePrimaryCtxRetain(self.id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 393, in safe_cuda_api_call
return self._check_cuda_python_error(fname, libfn(*args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 456, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [46] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_DEVICE_UNAVAILABLE
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/worker.py", line 1555, in close
self.log_event(self.address, {"action": "closing-worker", "reason": reason})
^^^^^^^^^^^^
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 635, in address
raise ValueError("cannot get address of non-running Server")
ValueError: cannot get address of non-running Server
2025-11-28 08:00:40,752 - distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 528, in start
await wait_for(self.start_unsafe(), timeout=timeout)
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/utils.py", line 1923, in wait_for
return await fut
^^^^^^^^^
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/worker.py", line 1406, in start_unsafe
await self.listen(start_address, **kwargs)
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 723, in listen
listener = await listen(
^^^^^^^^^^^^^
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/comm/core.py", line 256, in _
await self.start()
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed_ucxx/ucxx.py", line 751, in start
init_once()
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed_ucxx/ucxx.py", line 263, in init_once
numba.cuda.current_context()
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/devices.py", line 225, in get_context
return _runtime.get_or_create_context(devnum)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/devices.py", line 145, in get_or_create_context
return self._get_or_create_context_uncached(devnum)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/devices.py", line 162, in _get_or_create_context_uncached
return self._activate_context_for(0)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/devices.py", line 182, in _activate_context_for
newctx = gpu.get_primary_context()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 718, in get_primary_context
hctx = driver.cuDevicePrimaryCtxRetain(self.id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 393, in safe_cuda_api_call
return self._check_cuda_python_error(fname, libfn(*args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/numba_cuda/numba/cuda/cudadrv/driver.py", line 456, in _check_cuda_python_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [46] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_DEVICE_UNAVAILABLE
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 969, in run
async with worker:
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 542, in __aenter__
await self
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 536, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
2025-11-28 08:00:40,803 - distributed.nanny - ERROR - Failed to start process
numba.cuda.cudadrv.driver.CudaAPIError: [46] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_DEVICE_UNAVAILABLE
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 452, in instantiate
result = await self.process.start()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 759, in start
msg = await self._wait_until_connected(uid)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 905, in _wait_until_connected
raise msg["exception"]
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 969, in run
async with worker:
^^^^^^^^^^^^^^^^^
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 542, in __aenter__
await self
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 536, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
^^^^^^^^^^^^^^^^^
RuntimeError: Worker failed to start.
2025-11-28 08:00:43,359 - tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x14f92e08c290>>, <Task finished name='Task-73' coro=<SpecCluster._correct_state_internal() done, defined at /sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/deploy/spec.py:352> exception=RuntimeError('Nanny failed to start.')>)
numba.cuda.cudadrv.driver.CudaAPIError: [46] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_DEVICE_UNAVAILABLE
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 528, in start
await wait_for(self.start_unsafe(), timeout=timeout)
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/utils.py", line 1923, in wait_for
return await fut
^^^^^^^^^
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 369, in start_unsafe
response = await self.instantiate()
^^^^^^^^^^^^^^^^^^^^^^^^
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 452, in instantiate
result = await self.process.start()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 759, in start
msg = await self._wait_until_connected(uid)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 905, in _wait_until_connected
raise msg["exception"]
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/nanny.py", line 969, in run
async with worker:
^^^^^^^^^^^^^^^^^
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 542, in __aenter__
await self
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 536, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
^^^^^^^^^^^^^^^^^
RuntimeError: Worker failed to start.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/tornado/ioloop.py", line 758, in _run_callback
ret = callback()
^^^^^^^^^^
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/tornado/ioloop.py", line 782, in _discard_future_result
future.result()
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/deploy/spec.py", line 396, in _correct_state_internal
await asyncio.gather(*worker_futs)
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/asyncio/tasks.py", line 684, in _wrap_awaitable
return await awaitable
^^^^^^^^^^^^^^^
File "/sc/arion/scratch/elmoum02/new_environment/sc_gpu/lib/python3.12/site-packages/distributed/core.py", line 536, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Nanny failed to start.
i checked nvidia-smi and got the following output:
Fri Nov 28 08:56:12 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:3D:00.0 Off | Off |
| N/A 29C P0 108W / 700W | 527MiB / 81559MiB | 0% E. Process |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:4E:00.0 Off | Off |
| N/A 26C P0 65W / 700W | 5MiB / 81559MiB | 0% E. Process |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2027935 C ...w_environment/sc_gpu/bin/python3.12 516MiB |
+-----------------------------------------------------------------------------------------+
GPU0 GPU1 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV6 PXB NODE SYS SYS SYS SYS 0-11 0 N/A
GPU1 NV6 X NODE PXB SYS SYS SYS SYS 0-11 0 N/A
NIC0 PXB NODE X NODE SYS SYS SYS SYS
NIC1 NODE PXB NODE X SYS SYS SYS SYS
NIC2 SYS SYS SYS SYS X PIX NODE NODE
NIC3 SYS SYS SYS SYS PIX X NODE NODE
NIC4 SYS SYS SYS SYS NODE NODE X NODE
NIC5 SYS SYS SYS SYS NODE NODE NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
nvidia-smi -q -d COMPUTE | grep -A2 "Compute Mode
Compute Mode : Exclusive_Process
GPU 00000000:4E:00.0
Compute Mode : Exclusive_Process
Metadata
Metadata
Assignees
Labels
No labels