chore: Upgrade to tensorrt-llm==1.2.0rc3 (#4645)

tanmayv25 · web-flow · commit c9d7d95f4be0 · 2025-11-26T22:53:41.000-08:00
diff --git a/container/Dockerfile.trtllm b/container/Dockerfile.trtllm
@@ -199,8 +199,6 @@ ENV NIXL_LIB_DIR=$NIXL_PREFIX/lib/${ARCH_ALT}-linux-gnu
 ENV NIXL_PLUGIN_DIR=$NIXL_LIB_DIR/plugins
 # workaround for pickle lib issue
 ENV OMPI_MCA_coll_ucc_enable=0
-# Use UCX KVCACHE by default
-ENV TRTLLM_USE_UCX_KVCACHE=1
 
 ARG DYNAMO_COMMIT_SHA
 ENV DYNAMO_COMMIT_SHA=$DYNAMO_COMMIT_SHA
diff --git a/container/build.sh b/container/build.sh
@@ -98,7 +98,7 @@ TRTLLM_GIT_URL=""
 DEFAULT_TENSORRTLLM_INDEX_URL="https://pypi.nvidia.com/"
 # TODO: Remove the version specification from here and use the ai-dynamo[trtllm] package.
 # Need to update the Dockerfile.trtllm to use the ai-dynamo[trtllm] package.
-DEFAULT_TENSORRTLLM_PIP_WHEEL="tensorrt-llm==1.2.0rc2"
+DEFAULT_TENSORRTLLM_PIP_WHEEL="tensorrt-llm==1.2.0rc3"
 TENSORRTLLM_PIP_WHEEL=""
 
 VLLM_BASE_IMAGE="nvcr.io/nvidia/cuda-dl-base"
diff --git a/docs/backends/trtllm/kv-cache-transfer.md b/docs/backends/trtllm/kv-cache-transfer.md
@@ -21,39 +21,18 @@ limitations under the License.
 
 In disaggregated serving architectures, KV cache must be transferred between prefill and decode workers. TensorRT-LLM supports two methods for this transfer:
 
-## Default Method: UCX
-By default, TensorRT-LLM uses UCX (Unified Communication X) for KV cache transfer between prefill and decode workers. UCX provides high-performance communication optimized for GPU-to-GPU transfers.
+## Default Method: NIXL
+By default, TensorRT-LLM uses **NIXL** (NVIDIA Inference Xfer Library) with UCX (Unified Communication X) as backend for KV cache transfer between prefill and decode workers. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.
 
-## Beta Method: NIXL
-TensorRT-LLM also supports using **NIXL** (NVIDIA Inference Xfer Library) for KV cache transfer. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.
+### Specify Backends for NIXL
 
-**Note:** NIXL support in TensorRT-LLM is currently beta and may have some sharp edges.
+TODO: Add instructions for how to specify different backends for NIXL.
 
-## Using NIXL for KV Cache Transfer
+## Alternative Method: UCX
 
-**Note:** NIXL version shipped with current dynamo is not supported by tensorrt-llm<=1.2.0rc2. In order to use NIXL backend for KV cache transfer, users are required to build container image with tensorrt-llm>=1.2.0rc3.
+TensorRT-LLM can also leverage **UCX** (Unified Communication X) directly for KV cache transfer between prefill and decode workers. There are two ways to enable UCX as the KV cache transfer backend:
 
-To enable NIXL for KV cache transfer in disaggregated serving:
+1. **Recommended:** Set `cache_transceiver_config.backend: UCX` in your engine configuration YAML file.
+2. Alternatively, set the environment variable `TRTLLM_USE_UCX_KV_CACHE=1` and configure `cache_transceiver_config.backend: DEFAULT` in the engine configuration YAML.
 
-1. **Build the container with NIXL support(tensorrt-llm==1.2.0rc3):**
-   ```bash
-   ./container/build.sh --framework trtllm \
-     --tensorrtllm-pip-wheel tensorrt-llm==1.2.0rc3
-   ```
-
-2. **Run the containerized environment:**
-   See [run container](./README.md#run-container) section to learn how to start the container image built in previous step.
-
-   Within container, unset `TRTLLM_USE_UCX_KVCACHE` variable so NIXL can be used instead of UCX.
-
-   ```bash
-    unset TRTLLM_USE_UCX_KVCACHE
-    ```
-
-3. **Start the disaggregated service:**
-   See [disaggregated serving](./README.md#disaggregated-serving) to see how to start the deployment.
-
-4. **Send the request:**
-   See [client](./README.md#client) section to learn how to send the request to deployment.
-
-**Important:** Ensure that ETCD and NATS services are running before starting the service.
+This flexibility allows users to choose the most suitable method for their deployment and compatibility requirements.
diff --git a/pyproject.toml b/pyproject.toml
@@ -50,7 +50,7 @@ Repository = "https://github.com/ai-dynamo/dynamo.git"
 [project.optional-dependencies]
 trtllm =[
     "uvloop",
-    "tensorrt-llm==1.2.0rc2",
+    "tensorrt-llm==1.2.0rc3",
 ]
 
 vllm = [

Original file line number	Diff line number	Diff line change
`@@ -50,7 +50,7 @@ Repository = "https://github.com/ai-dynamo/dynamo.git"`
`50`	`50`	`[project.optional-dependencies]`
`51`	`51`	`trtllm =[`
`52`	`52`	`"uvloop",`
`53`		`- "tensorrt-llm==1.2.0rc2",`
	`53`	`+ "tensorrt-llm==1.2.0rc3",`
`54`	`54`	`]`
`55`	`55`
`56`	`56`	`vllm = [`