Skip to content

Commit b8de595

Browse files
feat: integrate openai frontend into triton cli (#106)
1 parent 3a3d1c5 commit b8de595

File tree

12 files changed

+516
-85
lines changed

12 files changed

+516
-85
lines changed

README.md

Lines changed: 64 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ Server.
3838

3939
## Table of Contents
4040

41-
| [Pre-requisites](#pre-requisites) | [Installation](#installation) | [Quickstart](#quickstart) | [Serving LLM Models](#serving-llm-models) | [Serving a vLLM Model](#serving-a-vllm-model) | [Serving a TRT-LLM Model](#serving-a-trt-llm-model) | [Additional Dependencies for Custom Environments](#additional-dependencies-for-custom-environments) | [Known Limitations](#known-limitations) |
41+
| [Pre-requisites](#pre-requisites) | [Installation](#installation) | [Quickstart](#quickstart) | [Serving LLM Models](#serving-llm-models) | [Serving a vLLM Model](#serving-a-vllm-model) | [Serving a TRT-LLM Model](#serving-a-trt-llm-model) | [Serving a LLM model with OpenAI API](#serving-a-llm-model-with-openai-api) | [Additional Dependencies for Custom Environments](#additional-dependencies-for-custom-environments) | [Known Limitations](#known-limitations) |
4242

4343
## Pre-requisites
4444

@@ -295,6 +295,69 @@ triton infer -m llama-3.1-8b-instruct --prompt "machine learning is"
295295
# Profile model with GenAI-Perf
296296
triton profile -m llama-3.1-8b-instruct --backend tensorrtllm
297297
```
298+
## Serving a LLM model with OpenAI API
299+
300+
Triton CLI could also start the triton server with a [OpenAI RESTful API Frontend](https://github.com/triton-inference-server/server/tree/main/python/openai).
301+
302+
Triton Server's OpenAI Frontend supports the following API endpoints:
303+
304+
- [POST /v1/chat/completions](https://platform.openai.com/docs/api-reference/chat/create)
305+
- [POST /v1/completions](https://platform.openai.com/docs/api-reference/completions/create)
306+
- [GET /v1/models](https://platform.openai.com/docs/api-reference/models/list)
307+
- [GET /v1/models/{model_name}](https://platform.openai.com/docs/api-reference/models/retrieve)
308+
- GET /metrics
309+
310+
To start the triton server with a OpenAI RESTful API Frontend, attach the `--frontend openai` to the `triton start` command.
311+
```bash
312+
triton start --frontend openai
313+
```
314+
By default, the server and its OpenAI API can be accessed at `http://localhost:9000`.
315+
316+
> [!NOTE]
317+
> There could be more than one LLM models in the model repository, each model could have its own tokenizer_config.json.
318+
> OpenAI's `/v1/chat/completions` API requires a chat template from a tokenizer. By default, Triton CLI will
319+
> automatically search for a tokenizer for the chat template in the model repository. If you'd like to set
320+
> a tokenizer's chat template, specify the tokenzier with `--openai-chat-template-tokenizer {higgingface id or path to the tokenizer directory}`
321+
>
322+
> ex: `triton start --frontend openai --openai-chat-template-tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct`
323+
324+
#### Example
325+
326+
```bash
327+
docker run -ti \
328+
--gpus all \
329+
--network=host \
330+
--shm-size=1g --ulimit memlock=-1 \
331+
-v /tmp:/tmp \
332+
-v ${HOME}/models:/root/models \
333+
-v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
334+
nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3
335+
336+
# Install the Triton CLI
337+
pip install git+https://github.com/triton-inference-server/triton_cli.git@main
338+
339+
# Authenticate with huggingface for restricted models like Llama-2 and Llama-3
340+
huggingface-cli login
341+
342+
# Build TRT LLM engine and generate a Triton model repository pointing at it
343+
triton remove -m all
344+
triton import -m llama-3.1-8b-instruct --backend tensorrtllm
345+
# For vllm backend:
346+
# triton import -m llama-3.1-8b-instruct --backend vllm
347+
348+
# Start Triton with a OpenAI RESTful API Frontend
349+
triton start --frontend openai
350+
351+
# Interact with model at http://localhost:9000
352+
curl -s http://localhost:9000/v1/chat/completions -H 'Content-Type: application/json' -d '{
353+
"model": "llama-3.1-8b-instruct",
354+
"messages": [{"role": "user", "content": "What is machine learning?"}]
355+
}'
356+
357+
# Profile model with GenAI-Perf
358+
triton profile -m llama-3.1-8b-instruct --service-kind openai --endpoint-type chat --url localhost:9000 --streaming
359+
```
360+
298361
## Additional Dependencies for Custom Environments
299362

300363
When using Triton CLI outside of official Triton NGC containers, you may

src/triton_cli/common.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,12 +38,19 @@ class TritonCLIException(Exception):
3838

3939
# Server
4040
DEFAULT_TRITONSERVER_PATH: str = "tritonserver"
41+
DEFAULT_TRITONSERVER_OPENAI_FRONTEND_PATH: str = (
42+
"/opt/tritonserver/python/openai/openai_frontend/main.py"
43+
)
44+
4145
## Server Docker
4246
DEFAULT_SHM_SIZE: str = "1G"
4347
# A custom image containing both vLLM and TRT-LLM dependencies,
4448
# defined in triton_cli/docker/Dockerfile.
4549
DEFAULT_TRITONSERVER_IMAGE: str = "triton_llm"
4650

51+
# Serving Frontend
52+
SUPPORTED_FRONTEND: set = {"kserve", "openai"}
53+
4754
# Model Repository
4855
DEFAULT_MODEL_REPO: Path = Path.home() / "models"
4956
DEFAULT_HF_CACHE: Path = Path.home() / ".cache" / "huggingface"

src/triton_cli/parser.py

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
#!/usr/bin/env python3
2-
# Copyright 2023-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# Copyright 2023-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
33
#
44
# Redistribution and use in source and binary forms, with or without
55
# modification, are permitted provided that the following conditions
@@ -40,6 +40,7 @@
4040
DEFAULT_MODEL_REPO,
4141
DEFAULT_TRITONSERVER_IMAGE,
4242
LOGGER_NAME,
43+
SUPPORTED_FRONTEND,
4344
TritonCLIException,
4445
)
4546
from triton_cli.client.client import InferenceServerException, TritonClient
@@ -156,6 +157,22 @@ def add_server_start_args(subcommands):
156157
default=300,
157158
help="Maximum number of seconds to wait for server startup. (Default: 300)",
158159
)
160+
subcommand.add_argument(
161+
"--frontend",
162+
choices=SUPPORTED_FRONTEND,
163+
type=str,
164+
required=False,
165+
default="kserve",
166+
help=f"The inference API frontend to use when starting the triton server. Default is the KServe api frontend. Choices: '{SUPPORTED_FRONTEND}'.",
167+
)
168+
subcommand.add_argument(
169+
"--openai-chat-template-tokenizer",
170+
type=str,
171+
required=False,
172+
# TODO: Should probably set a default tokenizer, like 'hf-internal-testing/llama-tokenizer', since not all tokenizers have a chat template
173+
default=None,
174+
help="HuggingFace ID or local folder path of the tokenizer to use for chat templates with the OpenAI API frontend. If no tokenizer is specified, it searches for and selects an LLM model's tokenizer from the model repository.",
175+
)
159176

160177

161178
def add_model_args(subcommands):

src/triton_cli/server/server_config.py

Lines changed: 65 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
#!/usr/bin/env python3
22

3-
# Copyright 2020-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
# Copyright 2020-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
44
#
55
# Licensed under the Apache License, Version 2.0 (the "License");
66
# you may not use this file except in compliance with the License.
@@ -14,6 +14,11 @@
1414
# See the License for the specific language governing permissions and
1515
# limitations under the License.
1616

17+
from triton_cli.common import (
18+
DEFAULT_TRITONSERVER_PATH,
19+
DEFAULT_TRITONSERVER_OPENAI_FRONTEND_PATH,
20+
)
21+
1722

1823
class TritonServerConfig:
1924
"""
@@ -73,12 +78,19 @@ class TritonServerConfig:
7378
"tensorflow-version",
7479
]
7580

76-
def __init__(self):
81+
def __init__(self, server_path=None):
7782
"""
7883
Construct TritonServerConfig
84+
85+
Parameters
86+
----------
87+
server_path: string
88+
path to the triton server binary. Default is "tritonserver" if unset.
7989
"""
8090

8191
self._server_args = {k: None for k in self.server_arg_keys}
92+
self._server_path = server_path if server_path else DEFAULT_TRITONSERVER_PATH
93+
self._server_name = "Triton Inference Server"
8294

8395
@classmethod
8496
def allowed_keys(cls):
@@ -172,6 +184,16 @@ def server_args(self):
172184

173185
return self._server_args
174186

187+
def server_path(self) -> str:
188+
"""
189+
Returns
190+
-------
191+
str
192+
A path to the triton server binary or script
193+
"""
194+
195+
return self._server_path
196+
175197
# TODO: Investigate what parameters are supported with TRT LLM's launching style.
176198
# For example, explicit launch mode is not. See the TRTLLMUtils class for a list of
177199
# supported args.
@@ -231,6 +253,45 @@ def __setitem__(self, key, value):
231253
self._server_args[kebab_cased_key] = value
232254
else:
233255
raise Exception(
234-
f"The argument '{key}' to the Triton Inference "
235-
"Server is not currently supported."
256+
f"The argument '{key}' to the {self._server_name}"
257+
" is not currently supported."
236258
)
259+
260+
261+
class TritonOpenAIServerConfig(TritonServerConfig):
262+
"""
263+
A config class to set arguments to the Triton Inference
264+
Server with OpenAI RESTful API. An argument set to None will use the server default.
265+
"""
266+
267+
server_arg_keys = [
268+
# triton server args
269+
"tritonserver-log-verbose-level",
270+
"host",
271+
"backend",
272+
"tokenizer",
273+
"model-repository",
274+
# uvicorn args
275+
"openai-port",
276+
"uvicorn-log-level",
277+
# kserve frontend args
278+
"enable-kserve-frontends",
279+
"kserve-http-port",
280+
"kserve-grpc-port",
281+
]
282+
283+
def __init__(self, server_path=None):
284+
"""
285+
Construct TritonOpenAIServerConfig
286+
287+
Parameters
288+
----------
289+
server_path: string
290+
path to the Triton OpenAI Server python script. Default is "/opt/tritonserver/python/openai/openai_frontend/main.py" if unset.
291+
"""
292+
293+
self._server_args = {k: None for k in self.server_arg_keys}
294+
self._server_path = (
295+
server_path if server_path else DEFAULT_TRITONSERVER_OPENAI_FRONTEND_PATH
296+
)
297+
self._server_name = "Triton Inference Server with OpenAI RESTful API"

src/triton_cli/server/server_docker.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
#!/usr/bin/env python3
22

3-
# Copyright 2020-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
# Copyright 2020-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
44
#
55
# Licensed under the Apache License, Version 2.0 (the "License");
66
# you may not use this file except in compliance with the License.
@@ -23,7 +23,6 @@
2323
from .server_utils import TritonServerUtils
2424
from triton_cli.common import (
2525
HF_CACHE,
26-
DEFAULT_TRITONSERVER_PATH,
2726
DEFAULT_TRITONSERVER_IMAGE,
2827
LOGGER_NAME,
2928
)
@@ -164,7 +163,6 @@ def start(self, env=None):
164163
}
165164
# Construct run command
166165
command = self._server_utils.get_launch_command(
167-
tritonserver_path=DEFAULT_TRITONSERVER_PATH,
168166
server_config=self._server_config,
169167
cmd_as_list=False,
170168
env_cmds=env_cmds,

src/triton_cli/server/server_factory.py

Lines changed: 56 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
#!/usr/bin/env python3
22

3-
# Copyright 2020-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
# Copyright 2020-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
44
#
55
# Licensed under the Apache License, Version 2.0 (the "License");
66
# you may not use this file except in compliance with the License.
@@ -15,19 +15,17 @@
1515
# limitations under the License.
1616

1717
import logging
18-
import os
1918
import shutil
2019

2120
from .server_local import TritonServerLocal
2221
from .server_docker import TritonServerDocker
23-
from .server_config import TritonServerConfig
22+
from .server_config import TritonServerConfig, TritonOpenAIServerConfig
2423
from triton_cli.common import (
2524
DEFAULT_SHM_SIZE,
26-
DEFAULT_TRITONSERVER_PATH,
2725
LOGGER_NAME,
2826
TritonCLIException,
2927
)
30-
28+
from .server_utils import TRTLLMUtils, VLLMUtils
3129

3230
logger = logging.getLogger(LOGGER_NAME)
3331

@@ -82,7 +80,7 @@ def create_server_docker(
8280
)
8381

8482
@staticmethod
85-
def create_server_local(path, config, gpus=None):
83+
def create_server_local(config, gpus=None):
8684
"""
8785
Parameters
8886
----------
@@ -99,7 +97,7 @@ def create_server_local(path, config, gpus=None):
9997
TritonServerLocal
10098
"""
10199

102-
return TritonServerLocal(path=path, config=config, gpus=gpus)
100+
return TritonServerLocal(config=config, gpus=gpus)
103101

104102
@staticmethod
105103
def get_server_handle(config, gpus=None):
@@ -130,15 +128,10 @@ def get_server_handle(config, gpus=None):
130128

131129
@staticmethod
132130
def _get_local_server_handle(config, gpus):
133-
tritonserver_path = DEFAULT_TRITONSERVER_PATH
134-
TritonServerFactory._validate_triton_server_path(tritonserver_path)
131+
triton_config = TritonServerFactory._get_triton_server_config(config)
132+
TritonServerFactory._validate_triton_server_path(triton_config.server_path())
135133

136-
triton_config = TritonServerConfig()
137-
triton_config["model-repository"] = config.model_repository
138-
if config.verbose:
139-
triton_config["log-verbose"] = "1"
140134
server = TritonServerFactory.create_server_local(
141-
path=tritonserver_path,
142135
config=triton_config,
143136
gpus=gpus,
144137
)
@@ -147,10 +140,7 @@ def _get_local_server_handle(config, gpus):
147140

148141
@staticmethod
149142
def _get_docker_server_handle(config, gpus):
150-
triton_config = TritonServerConfig()
151-
triton_config["model-repository"] = os.path.abspath(config.model_repository)
152-
if config.verbose:
153-
triton_config["log-verbose"] = "1"
143+
triton_config = TritonServerFactory._get_triton_server_config(config)
154144

155145
server = TritonServerFactory.create_server_docker(
156146
image=config.image,
@@ -174,3 +164,51 @@ def _validate_triton_server_path(tritonserver_path):
174164
raise TritonCLIException(
175165
f"Either the binary {tritonserver_path} is invalid, not on the PATH, or does not have the correct permissions."
176166
)
167+
168+
@staticmethod
169+
def _get_triton_server_config(config):
170+
if config.frontend == "openai":
171+
triton_config = TritonOpenAIServerConfig()
172+
triton_config["model-repository"] = config.model_repository
173+
174+
triton_config["tokenizer"] = (
175+
TritonServerFactory._get_openai_chat_template_tokenizer(config)
176+
)
177+
178+
if config.verbose:
179+
triton_config["tritonserver-log-verbose-level"] = "1"
180+
else:
181+
triton_config = TritonServerConfig()
182+
triton_config["model-repository"] = config.model_repository
183+
if config.verbose:
184+
triton_config["log-verbose"] = "1"
185+
186+
return triton_config
187+
188+
@staticmethod
189+
def _get_openai_chat_template_tokenizer(config):
190+
"""
191+
Raises an exception if a tokenizer can not be found and is not specified with OpenAI Frontend
192+
"""
193+
if config.openai_chat_template_tokenizer:
194+
return config.openai_chat_template_tokenizer
195+
196+
logger.info(
197+
"OpenAI frontend's tokenizer for chat template is not specify, searching for an available tokenizer in the model repository."
198+
)
199+
trtllm_utils = TRTLLMUtils(config.model_repository)
200+
vllm_utils = VLLMUtils(config.model_repository)
201+
202+
if trtllm_utils.has_trtllm_model():
203+
tokenizer_path = trtllm_utils.get_engine_path()
204+
elif vllm_utils.has_vllm_model():
205+
tokenizer_path = vllm_utils.get_vllm_model_huggingface_id_or_path()
206+
else:
207+
raise TritonCLIException(
208+
"Unable to find a tokenizer to start the Triton OpenAI RESTful API, please use '--openai-chat-template-tokenizer' to specify a tokenizer."
209+
)
210+
211+
logger.info(
212+
f"Found tokenizer in '{tokenizer_path}' after searching for the tokenizer in the model repository"
213+
)
214+
return tokenizer_path

0 commit comments

Comments
 (0)