GitHub - shern2/test-triton-pub: Triton related development/experiments for better ML pipeline inference performance

Pre-requisites

Assumes you have the following setup: Docker, Nvidia Container Toolkit, Nvidia GPU, conda (This project was tested on Ubuntu 22.04).

Installation

Setup local python environment

WARNING Strongly recommend to use a separate python env as tensorrt installation / updates can mess files up and it's easier to do a clean install...

cd $PROJECT_DIR # i.e. the repo root folder
conda create -n compile python=3.12
conda activate compile
pip install -r requirements_compile.txt

Setup triton inference server container

cd $PROJECT_DIR
docker compose build
# Start the container running idly in the background.
docker compose up -d

Directory structure

# The FastAPI application serving as a proxy to the triton server, plus some start scripts
app/
# The model repository for Triton
# NOTE: Follows a strict directory structure of `models/<model_name>/<model_version>/`.
# The `<model_name>` MUST agree with the `name` in the `config.pbtxt` file.
models/
# Jupyter notebooks (to run locally for compiling models and testing)
notebooks/
# Helper functions/scripts
src/

case 1: Simple neural network model

# Run `notebooks/01_compile_pt__simple_net.ipynb.ipynb` to export the simple neural network model and output a TorchScript `model.pt` file under `models/simple_net/1/`.

# Enter the container
docker exec -it triton /bin/bash

# Start the triton server (simple_net model only)
./dev_start.sh

# Use `notebooks/00_test_fastapi_proxy_output.ipynb` to validate that the Triton outputs are as expected.

case 2: FinBERT model

Run notebooks/02_compile_onnx__finbert.ipynb. This will

Export the FinBERT model to ONNX
Optimize the ONNX model using TensorRT and save it to the model repository
Save the tokenizer as a "model" in the model repository
Note that the prediction pipeline (a.k.a. "ensemble" model in Nvidia lingo) is pre-setup in models/finbert/.

Modify dev_start.sh tritonserver command to start the FinBERT models, before running it:

--load-model=finbert-model \
--load-model=finbert-tokenizer \
--load-model=finbert \

Validate using notebooks/00_test_fastapi_proxy_output.ipynb.

case 3: FinBERT model (TensorRT) multiple instances

Modify dev_start.sh tritonserver command to start the FinBERT models, before running it:

--load-model=finbert-trt-model \

Modify the models/finbert-trt-model/config.pbtxt file's instance_group by varying the number of instances and seeing the GPU usage via nvidia-smi. Validate that the GPU memory does not increase linearly with the number of instances, but rather less - i.e. the weight-sharing is working.

instance_group [
    {
        kind: KIND_GPU
        count: 4
    }
]

References

https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md https://pytorch.org/TensorRT/tutorials/_rendered_examples/dynamo/torch_compile_transformers_example.html https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_aio_infer_client.py https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver/tags

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
models		models
notebooks		notebooks
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
requirements_compile.txt		requirements_compile.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pre-requisites

Installation

Setup local python environment

Setup triton inference server container

Directory structure

case 1: Simple neural network model

case 2: FinBERT model

case 3: FinBERT model (TensorRT) multiple instances

References

About

Uh oh!

Releases

Packages

Languages

shern2/test-triton-pub

Folders and files

Latest commit

History

Repository files navigation

Pre-requisites

Installation

Setup local python environment

Setup triton inference server container

Directory structure

case 1: Simple neural network model

case 2: FinBERT model

case 3: FinBERT model (TensorRT) multiple instances

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages