Skip to content

Commit c9d7d95

Browse files
authored
chore: Upgrade to tensorrt-llm==1.2.0rc3 (#4645)
1 parent d52f070 commit c9d7d95

File tree

4 files changed

+11
-34
lines changed

4 files changed

+11
-34
lines changed

container/Dockerfile.trtllm

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -199,8 +199,6 @@ ENV NIXL_LIB_DIR=$NIXL_PREFIX/lib/${ARCH_ALT}-linux-gnu
199199
ENV NIXL_PLUGIN_DIR=$NIXL_LIB_DIR/plugins
200200
# workaround for pickle lib issue
201201
ENV OMPI_MCA_coll_ucc_enable=0
202-
# Use UCX KVCACHE by default
203-
ENV TRTLLM_USE_UCX_KVCACHE=1
204202

205203
ARG DYNAMO_COMMIT_SHA
206204
ENV DYNAMO_COMMIT_SHA=$DYNAMO_COMMIT_SHA

container/build.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,7 @@ TRTLLM_GIT_URL=""
9898
DEFAULT_TENSORRTLLM_INDEX_URL="https://pypi.nvidia.com/"
9999
# TODO: Remove the version specification from here and use the ai-dynamo[trtllm] package.
100100
# Need to update the Dockerfile.trtllm to use the ai-dynamo[trtllm] package.
101-
DEFAULT_TENSORRTLLM_PIP_WHEEL="tensorrt-llm==1.2.0rc2"
101+
DEFAULT_TENSORRTLLM_PIP_WHEEL="tensorrt-llm==1.2.0rc3"
102102
TENSORRTLLM_PIP_WHEEL=""
103103

104104
VLLM_BASE_IMAGE="nvcr.io/nvidia/cuda-dl-base"

docs/backends/trtllm/kv-cache-transfer.md

Lines changed: 9 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -21,39 +21,18 @@ limitations under the License.
2121

2222
In disaggregated serving architectures, KV cache must be transferred between prefill and decode workers. TensorRT-LLM supports two methods for this transfer:
2323

24-
## Default Method: UCX
25-
By default, TensorRT-LLM uses UCX (Unified Communication X) for KV cache transfer between prefill and decode workers. UCX provides high-performance communication optimized for GPU-to-GPU transfers.
24+
## Default Method: NIXL
25+
By default, TensorRT-LLM uses **NIXL** (NVIDIA Inference Xfer Library) with UCX (Unified Communication X) as backend for KV cache transfer between prefill and decode workers. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.
2626

27-
## Beta Method: NIXL
28-
TensorRT-LLM also supports using **NIXL** (NVIDIA Inference Xfer Library) for KV cache transfer. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.
27+
### Specify Backends for NIXL
2928

30-
**Note:** NIXL support in TensorRT-LLM is currently beta and may have some sharp edges.
29+
TODO: Add instructions for how to specify different backends for NIXL.
3130

32-
## Using NIXL for KV Cache Transfer
31+
## Alternative Method: UCX
3332

34-
**Note:** NIXL version shipped with current dynamo is not supported by tensorrt-llm<=1.2.0rc2. In order to use NIXL backend for KV cache transfer, users are required to build container image with tensorrt-llm>=1.2.0rc3.
33+
TensorRT-LLM can also leverage **UCX** (Unified Communication X) directly for KV cache transfer between prefill and decode workers. There are two ways to enable UCX as the KV cache transfer backend:
3534

36-
To enable NIXL for KV cache transfer in disaggregated serving:
35+
1. **Recommended:** Set `cache_transceiver_config.backend: UCX` in your engine configuration YAML file.
36+
2. Alternatively, set the environment variable `TRTLLM_USE_UCX_KV_CACHE=1` and configure `cache_transceiver_config.backend: DEFAULT` in the engine configuration YAML.
3737

38-
1. **Build the container with NIXL support(tensorrt-llm==1.2.0rc3):**
39-
```bash
40-
./container/build.sh --framework trtllm \
41-
--tensorrtllm-pip-wheel tensorrt-llm==1.2.0rc3
42-
```
43-
44-
2. **Run the containerized environment:**
45-
See [run container](./README.md#run-container) section to learn how to start the container image built in previous step.
46-
47-
Within container, unset `TRTLLM_USE_UCX_KVCACHE` variable so NIXL can be used instead of UCX.
48-
49-
```bash
50-
unset TRTLLM_USE_UCX_KVCACHE
51-
```
52-
53-
3. **Start the disaggregated service:**
54-
See [disaggregated serving](./README.md#disaggregated-serving) to see how to start the deployment.
55-
56-
4. **Send the request:**
57-
See [client](./README.md#client) section to learn how to send the request to deployment.
58-
59-
**Important:** Ensure that ETCD and NATS services are running before starting the service.
38+
This flexibility allows users to choose the most suitable method for their deployment and compatibility requirements.

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ Repository = "https://github.com/ai-dynamo/dynamo.git"
5050
[project.optional-dependencies]
5151
trtllm =[
5252
"uvloop",
53-
"tensorrt-llm==1.2.0rc2",
53+
"tensorrt-llm==1.2.0rc3",
5454
]
5555

5656
vllm = [

0 commit comments

Comments
 (0)