You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/backends/trtllm/kv-cache-transfer.md
+9-30Lines changed: 9 additions & 30 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,39 +21,18 @@ limitations under the License.
21
21
22
22
In disaggregated serving architectures, KV cache must be transferred between prefill and decode workers. TensorRT-LLM supports two methods for this transfer:
23
23
24
-
## Default Method: UCX
25
-
By default, TensorRT-LLM uses UCX (Unified Communication X) for KV cache transfer between prefill and decode workers. UCX provides high-performance communication optimized for GPU-to-GPU transfers.
24
+
## Default Method: NIXL
25
+
By default, TensorRT-LLM uses **NIXL** (NVIDIA Inference Xfer Library) with UCX (Unified Communication X) as backend for KV cache transfer between prefill and decode workers. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.
26
26
27
-
## Beta Method: NIXL
28
-
TensorRT-LLM also supports using **NIXL** (NVIDIA Inference Xfer Library) for KV cache transfer. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.
27
+
### Specify Backends for NIXL
29
28
30
-
**Note:** NIXL support in TensorRT-LLM is currently beta and may have some sharp edges.
29
+
TODO: Add instructions for how to specify different backends for NIXL.
31
30
32
-
## Using NIXL for KV Cache Transfer
31
+
## Alternative Method: UCX
33
32
34
-
**Note:**NIXL version shipped with current dynamo is not supported by tensorrt-llm<=1.2.0rc2. In order to use NIXL backend for KV cache transfer, users are required to build container image with tensorrt-llm>=1.2.0rc3.
33
+
TensorRT-LLM can also leverage **UCX**(Unified Communication X) directly for KV cache transfer between prefill and decode workers. There are two ways to enable UCX as the KV cache transfer backend:
35
34
36
-
To enable NIXL for KV cache transfer in disaggregated serving:
35
+
1.**Recommended:** Set `cache_transceiver_config.backend: UCX` in your engine configuration YAML file.
36
+
2. Alternatively, set the environment variable `TRTLLM_USE_UCX_KV_CACHE=1` and configure `cache_transceiver_config.backend: DEFAULT` in the engine configuration YAML.
37
37
38
-
1.**Build the container with NIXL support(tensorrt-llm==1.2.0rc3):**
39
-
```bash
40
-
./container/build.sh --framework trtllm \
41
-
--tensorrtllm-pip-wheel tensorrt-llm==1.2.0rc3
42
-
```
43
-
44
-
2.**Run the containerized environment:**
45
-
See [run container](./README.md#run-container) section to learn how to start the container image built in previous step.
46
-
47
-
Within container, unset `TRTLLM_USE_UCX_KVCACHE` variable so NIXL can be used instead of UCX.
48
-
49
-
```bash
50
-
unset TRTLLM_USE_UCX_KVCACHE
51
-
```
52
-
53
-
3. **Start the disaggregated service:**
54
-
See [disaggregated serving](./README.md#disaggregated-serving) to see how to start the deployment.
55
-
56
-
4. **Send the request:**
57
-
See [client](./README.md#client) section to learn how to send the request to deployment.
58
-
59
-
**Important:** Ensure that ETCD and NATS services are running before starting the service.
38
+
This flexibility allows users to choose the most suitable method for their deployment and compatibility requirements.
0 commit comments