Skip to content

Commit 3a3d1c5

Browse files
build: Update for 25.01, TRTLLM v0.17.0.post1, and fix HF_HOME setting (#104)
Co-authored-by: Ryan McCormick <[email protected]>
1 parent 758dec4 commit 3a3d1c5

File tree

18 files changed

+1494
-236
lines changed

18 files changed

+1494
-236
lines changed

README.md

Lines changed: 48 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,30 @@
1+
<!--
2+
# Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
#
4+
# Redistribution and use in source and binary forms, with or without
5+
# modification, are permitted provided that the following conditions
6+
# are met:
7+
# * Redistributions of source code must retain the above copyright
8+
# notice, this list of conditions and the following disclaimer.
9+
# * Redistributions in binary form must reproduce the above copyright
10+
# notice, this list of conditions and the following disclaimer in the
11+
# documentation and/or other materials provided with the distribution.
12+
# * Neither the name of NVIDIA CORPORATION nor the names of its
13+
# contributors may be used to endorse or promote products derived
14+
# from this software without specific prior written permission.
15+
#
16+
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
17+
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
18+
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
19+
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
20+
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
21+
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
22+
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
23+
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
24+
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
25+
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
26+
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -->
27+
128
# Triton Command Line Interface (Triton CLI)
229
> [!NOTE]
330
> Triton CLI is currently in BETA. Its features and functionality are likely
@@ -22,8 +49,8 @@ and running the CLI from within the latest corresponding `tritonserver`
2249
container image, which should have all necessary system dependencies installed.
2350

2451
For vLLM and TRT-LLM, you can use their respective images:
25-
- `nvcr.io/nvidia/tritonserver:24.10-vllm-python-py3`
26-
- `nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3`
52+
- `nvcr.io/nvidia/tritonserver:25.01-vllm-python-py3`
53+
- `nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3`
2754

2855
If you decide to run the CLI on the host or in a custom image, please
2956
see this list of [additional dependencies](#additional-dependencies-for-custom-environments)
@@ -38,6 +65,7 @@ matrix below:
3865

3966
| Triton CLI Version | TRT-LLM Version | Triton Container Tag |
4067
|:------------------:|:---------------:|:--------------------:|
68+
| 0.1.2 | v0.17.0.post1 | 25.01 |
4169
| 0.1.1 | v0.14.0 | 24.10 |
4270
| 0.1.0 | v0.13.0 | 24.09 |
4371
| 0.0.11 | v0.12.0 | 24.08 |
@@ -60,7 +88,7 @@ It is also possible to install from a specific branch name, a commit hash
6088
or a tag name. For example to install `triton_cli` with a specific tag:
6189

6290
```bash
63-
GIT_REF="0.1.1"
91+
GIT_REF="0.1.2"
6492
pip install git+https://github.com/triton-inference-server/triton_cli.git@${GIT_REF}
6593
```
6694

@@ -95,7 +123,7 @@ triton -h
95123
triton import -m gpt2
96124

97125
# Start server pointing at the default model repository
98-
triton start --image nvcr.io/nvidia/tritonserver:24.10-vllm-python-py3
126+
triton start --image nvcr.io/nvidia/tritonserver:25.01-vllm-python-py3
99127

100128
# Infer with CLI
101129
triton infer -m gpt2 --prompt "machine learning is"
@@ -120,6 +148,12 @@ minutes.
120148
> Also, usage of certain restricted models like Llama models requires authentication
121149
> in Huggingface through either `huggingface-cli login` or setting the `HF_TOKEN`
122150
> environment variable.
151+
>
152+
> If your huggingface cache is not located at `${HOME}/.cache/huggingface`, you could
153+
> set the huggingface cache with
154+
>
155+
> ex: `export HF_HOME=path/to/your/huggingface/cache`
156+
>
123157
124158
### Model Sources
125159

@@ -175,26 +209,26 @@ docker run -ti \
175209
--shm-size=1g --ulimit memlock=-1 \
176210
-v ${HOME}/models:/root/models \
177211
-v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
178-
nvcr.io/nvidia/tritonserver:24.10-vllm-python-py3
212+
nvcr.io/nvidia/tritonserver:25.01-vllm-python-py3
179213

180214
# Install the Triton CLI
181-
pip install git+https://github.com/triton-inference-server/[email protected].1
215+
pip install git+https://github.com/triton-inference-server/[email protected].2
182216

183217
# Authenticate with huggingface for restricted models like Llama-2 and Llama-3
184218
huggingface-cli login
185219

186220
# Generate a Triton model repository containing a vLLM model config
187221
triton remove -m all
188-
triton import -m llama-3-8b-instruct --backend vllm
222+
triton import -m llama-3.1-8b-instruct --backend vllm
189223

190224
# Start Triton pointing at the default model repository
191225
triton start
192226

193227
# Interact with model
194-
triton infer -m llama-3-8b-instruct --prompt "machine learning is"
228+
triton infer -m llama-3.1-8b-instruct --prompt "machine learning is"
195229

196230
# Profile model with GenAI-Perf
197-
triton profile -m llama-3-8b-instruct --backend vllm
231+
triton profile -m llama-3.1-8b-instruct --backend vllm
198232
```
199233

200234
### Serving a TRT-LLM Model
@@ -240,26 +274,26 @@ docker run -ti \
240274
-v /tmp:/tmp \
241275
-v ${HOME}/models:/root/models \
242276
-v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
243-
nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3
277+
nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3
244278

245279
# Install the Triton CLI
246-
pip install git+https://github.com/triton-inference-server/[email protected].0
280+
pip install git+https://github.com/triton-inference-server/[email protected].2
247281

248282
# Authenticate with huggingface for restricted models like Llama-2 and Llama-3
249283
huggingface-cli login
250284

251285
# Build TRT LLM engine and generate a Triton model repository pointing at it
252286
triton remove -m all
253-
triton import -m llama-3-8b-instruct --backend tensorrtllm
287+
triton import -m llama-3.1-8b-instruct --backend tensorrtllm
254288

255289
# Start Triton pointing at the default model repository
256290
triton start
257291

258292
# Interact with model
259-
triton infer -m llama-3-8b-instruct --prompt "machine learning is"
293+
triton infer -m llama-3.1-8b-instruct --prompt "machine learning is"
260294

261295
# Profile model with GenAI-Perf
262-
triton profile -m llama-3-8b-instruct --backend tensorrtllm
296+
triton profile -m llama-3.1-8b-instruct --backend tensorrtllm
263297
```
264298
## Additional Dependencies for Custom Environments
265299

pyproject.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Copyright 2023-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
1+
# Copyright 2023-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
#
33
# Redistribution and use in source and binary forms, with or without
44
# modification, are permitted provided that the following conditions
@@ -51,7 +51,7 @@ dependencies = [
5151
"grpcio>=1.67.0",
5252
# Use explicit client version matching genai-perf version for tagged release
5353
"tritonclient[all] == 2.51",
54-
"genai-perf @ git+https://github.com/triton-inference-server/perf_analyzer.git@r24.10#subdirectory=genai-perf",
54+
"genai-perf @ git+https://github.com/triton-inference-server/perf_analyzer.git@r25.01#subdirectory=genai-perf",
5555
# Misc deps
5656
"directory-tree == 0.0.4", # may remove in future
5757
# https://github.com/docker/docker-py/issues/3256#issuecomment-2376439000

src/triton_cli/__init__.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Copyright 2023-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
1+
# Copyright 2023-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
#
33
# Redistribution and use in source and binary forms, with or without
44
# modification, are permitted provided that the following conditions
@@ -24,4 +24,4 @@
2424
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
2525
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
2626

27-
__version__ = "0.1.1"
27+
__version__ = "0.1.2"

src/triton_cli/common.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
#!/usr/bin/env python3
2-
# Copyright 2023-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# Copyright 2023-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
33
#
44
# Redistribution and use in source and binary forms, with or without
55
# modification, are permitted provided that the following conditions
@@ -47,5 +47,5 @@ class TritonCLIException(Exception):
4747
# Model Repository
4848
DEFAULT_MODEL_REPO: Path = Path.home() / "models"
4949
DEFAULT_HF_CACHE: Path = Path.home() / ".cache" / "huggingface"
50-
HF_CACHE: Path = Path(os.environ.get("TRANSFORMERS_CACHE", DEFAULT_HF_CACHE))
50+
HF_CACHE: Path = Path(os.environ.get("HF_HOME", DEFAULT_HF_CACHE))
5151
SUPPORTED_BACKENDS: set = {"vllm", "tensorrtllm"}

src/triton_cli/docker/Dockerfile

Lines changed: 29 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,37 @@
1+
# Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
#
3+
# Redistribution and use in source and binary forms, with or without
4+
# modification, are permitted provided that the following conditions
5+
# are met:
6+
# * Redistributions of source code must retain the above copyright
7+
# notice, this list of conditions and the following disclaimer.
8+
# * Redistributions in binary form must reproduce the above copyright
9+
# notice, this list of conditions and the following disclaimer in the
10+
# documentation and/or other materials provided with the distribution.
11+
# * Neither the name of NVIDIA CORPORATION nor the names of its
12+
# contributors may be used to endorse or promote products derived
13+
# from this software without specific prior written permission.
14+
#
15+
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
16+
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
17+
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
18+
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
19+
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
20+
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
21+
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
22+
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
23+
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
24+
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
25+
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
26+
127
# TRT-LLM image contains engine building and runtime dependencies
2-
FROM nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3
28+
FROM nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3
329

430
# Setup vLLM Triton backend
531
RUN mkdir -p /opt/tritonserver/backends/vllm && \
6-
git clone -b r24.10 https://github.com/triton-inference-server/vllm_backend.git /tmp/vllm_backend && \
32+
git clone -b r25.01 https://github.com/triton-inference-server/vllm_backend.git /tmp/vllm_backend && \
733
cp -r /tmp/vllm_backend/src/* /opt/tritonserver/backends/vllm && \
834
rm -r /tmp/vllm_backend
935

1036
# vLLM runtime dependencies
11-
RUN pip install "vllm==0.5.3.post1" "setuptools==74.0.0"
37+
RUN pip install "vllm==0.6.3.post1" "setuptools==74.0.0"

src/triton_cli/repository.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Copyright 2023-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
1+
# Copyright 2023-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
#
33
# Redistribution and use in source and binary forms, with or without
44
# modification, are permitted provided that the following conditions
@@ -372,6 +372,10 @@ def __build_trtllm_engine(self, huggingface_id: str, engines_path: Path):
372372
# TODO: Investigate if LLM is internally saving a copy to a temp dir
373373
engine.save(str(engines_path))
374374

375+
# The new trtllm(v0.17.0+) requires explicit calling shutdown to shutdown
376+
# the mpi blocking thread, or the engine process won't exit
377+
engine.shutdown()
378+
375379
def __create_model_repository(
376380
self, name: str, version: int = 1, backend: str = None
377381
):

src/triton_cli/templates/trt_llm/postprocessing/1/model.py

Lines changed: 1 addition & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -132,13 +132,7 @@ def execute(self, requests):
132132
for batch_idx, beam_tokens in enumerate(token_batch):
133133
for beam_idx, tokens in enumerate(beam_tokens):
134134
seq_len = sequence_lengths[idx][batch_idx][beam_idx]
135-
# Exclude fake ids in multimodal models
136-
fake_id_len = 0
137-
for i in range(seq_len):
138-
if tokens[i] < self.tokenizer.vocab_size:
139-
fake_id_len = i
140-
break
141-
list_of_tokens.append(tokens[fake_id_len:seq_len])
135+
list_of_tokens.append(tokens[:seq_len])
142136
req_idx_offset += 1
143137

144138
req_idx_offsets.append(req_idx_offset)

src/triton_cli/templates/trt_llm/postprocessing/config.pbtxt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,4 +67,4 @@ instance_group [
6767
count: ${postprocessing_instance_count}
6868
kind: KIND_CPU
6969
}
70-
]
70+
]

0 commit comments

Comments
 (0)