Server crashes when using TensorRT-LLM engine

Hello.
I have pulled 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.33.0-tensorrtllm0.21.0-cu128 container to use it for serving of Llama based model.
Here is my serving.properties file content:
```
engine=MPI
option.rolling_batch=trtllm
option.trust_remote_code=true
option.max_input_len=32768
option.max_output_len=32768
option.max_num_tokens=32768
option.max_rolling_batch_size=32
option.tensor_parallel_degree=1
option.enable_streaming=false
```

Server is being started, launch is successful.
```
INFO  WorkerPool scaling up min workers by 1 (from 0 to 1) workers. Total range is min 1 to max 1
INFO  WorkerThread Starting worker thread WT-0001 for model mymodel (M-0001, READY) on device gpu(0)
INFO  ModelServer Initialize BOTH server with: EpollServerSocketChannel.
INFO  ModelServer BOTH API bind to: http://0.0.0.0:8080
```

Any request with valid/invalid schema causing server to crash:
For example :
```
INFO  PyProcess W-123-mymodel-stdout: [2,0]<stdout>:[TensorRT-LLM][INFO] [MemUsageChange] Allocated 4608.01 MiB for execution context memory.
INFO  PyProcess W-123-mymodel-stdout: [2,0]<stdout>:[TensorRT-LLM][INFO] gatherContextLogits: 0
INFO  PyProcess W-123-mymodel-stdout: [2,0]<stdout>:[TensorRT-LLM][INFO] gatherGenerationLogits: 0
INFO  PyProcess W-123-mymodel-stdout: [2,0]<stdout>:[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 15380 (MiB)
INFO  PyProcess W-123-mymodel-stdout: [2,0]<stdout>:[TensorRT-LLM][INFO] [MemUsageChange] Allocated 95.78 MB GPU memory for runtime buffers.
INFO  PyProcess W-123-mymodel-stdout: [2,0]<stdout>:[TensorRT-LLM][INFO] [MemUsageChange] Allocated 128.63 MB GPU memory for decoder.
INFO  PyProcess W-123-mymodel-stdout: [2,0]<stdout>:[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 44.52 GiB, available: 23.82 GiB, extraCostMemory: 0.00 GiB
INFO  PyProcess W-123-mymodel-stdout: [2,0]<stdout>:[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 5488
INFO  PyProcess W-123-mymodel-stdout: [2,0]<stdout>:[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
INFO  PyProcess W-123-mymodel-stdout: [2,0]<stdout>:[TensorRT-LLM][INFO] before Create KVCacheManager cacheTransPreAllocaSize:0
INFO  PyProcess W-123-mymodel-stdout: [2,0]<stdout>:[TensorRT-LLM][INFO] Max KV cache pages per sequence: 4096 [window size=131072]
INFO  PyProcess W-123-mymodel-stdout: [2,0]<stdout>:[TensorRT-LLM][INFO] Number of tokens per block: 32.
INFO  PyProcess W-123-mymodel-stdout: [2,0]<stdout>:[TensorRT-LLM][INFO] [MemUsageChange] Allocated 21.44 GiB for max tokens in paged KV cache (175616).
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:INFO::trtllm service initialized
INFO  AsyncRequestManager process is not ready
INFO  PyProcess Model [mymodel] initialized.
INFO  PyModel mymodel model loaded in 58569 ms.
INFO  WorkerPool scaling up min workers by 1 (from 0 to 1) workers. Total range is min 1 to max 1
INFO  WorkerThread Starting worker thread WT-0001 for model chat_film (M-0001, READY) on device gpu(0)
INFO  ModelServer Initialize BOTH server with: EpollServerSocketChannel.
INFO  ModelServer BOTH API bind to: http://0.0.0.0:8080
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:ERROR::Input parsing failed
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:Traceback (most recent call last):
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:  File "/tmp/.djl.ai/python/0.33.0/djl_python/lmi_trtllm/trtllm_async_service.py", line 147, in inference
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:    processed_request = self.preprocess_requests(inputs)
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:  File "/tmp/.djl.ai/python/0.33.0/djl_python/lmi_trtllm/trtllm_async_service.py", line 123, in preprocess_requests
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:    request = ChatCompletionRequest(**decoded_payload)
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:  File "/usr/local/lib/python3.12/dist-packages/pydantic/main.py", line 253, in __init__
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:pydantic_core._pydantic_core.ValidationError: 1 validation error for ChatCompletionRequest
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:model_name
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:  Extra inputs are not permitted [type=extra_forbidden, input_value='mymodel', input_type=str]
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:    For further information visit https://errors.pydantic.dev/2.11/v/extra_forbidden
ERROR Connection Exception reading Output from python process
java.lang.NullPointerException: Cannot invoke "ai.djl.python.engine.Request.addResponse(byte[], java.util.Map)" because "request" is null
	at ai.djl.python.engine.AsyncRequestManager.sendInferenceResponse(AsyncRequestManager.java:129) ~[python-0.33.0.jar:?]
	at ai.djl.python.engine.AsyncRequestManager.addOutput(AsyncRequestManager.java:86) ~[python-0.33.0.jar:?]
	at ai.djl.python.engine.Connection$RequestHandler.channelRead0(Connection.java:460) ~[python-0.33.0.jar:?]
	at ai.djl.python.engine.Connection$RequestHandler.channelRead0(Connection.java:443) ~[python-0.33.0.jar:?]
	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) ~[netty-transport-4.1.119.Final.jar:4.1.119.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) [netty-transport-4.1.119.Final.jar:4.1.119.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.119.Final.jar:4.1.119.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.119.Final.jar:4.1.119.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346) [netty-codec-4.1.119.Final.jar:4.1.119.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318) [netty-codec-4.1.119.Final.jar:4.1.119.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) [netty-transport-4.1.119.Final.jar:4.1.119.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.119.Final.jar:4.1.119.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.119.Final.jar:4.1.119.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1357) [netty-transport-4.1.119.Final.jar:4.1.119.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) [netty-transport-4.1.119.Final.jar:4.1.119.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.119.Final.jar:4.1.119.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:868) [netty-transport-4.1.119.Final.jar:4.1.119.Final]
	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:799) [netty-transport-classes-epoll-4.1.119.Final.jar:4.1.119.Final]
	at io.netty.channel.epoll.EpollDomainSocketChannel$EpollDomainUnsafe.epollInReady(EpollDomainSocketChannel.java:138) [netty-transport-classes-epoll-4.1.119.Final.jar:4.1.119.Final]
	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:501) [netty-transport-classes-epoll-4.1.119.Final.jar:4.1.119.Final]
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:399) [netty-transport-classes-epoll-4.1.119.Final.jar:4.1.119.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998) [netty-common-4.1.119.Final.jar:4.1.119.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.119.Final.jar:4.1.119.Final]
	at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:ERROR::<bound method PythonAsyncEngine.receive_requests of <djl_python.python_async_engine.PythonAsyncEngine object at 0x7fba4a2d4e60>> failed. Details Connection disconnected
ERROR PyProcess predict[init=false] exception: java.util.concurrent.ExecutionException
INFO  PyProcess Stop process: 0:123, failure=true
INFO  PyProcess Failure count: 0
ERROR PyProcess Restarting python worker
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:ERROR::<bound method PythonAsyncEngine.send_responses of <djl_python.python_async_engine.PythonAsyncEngine object at 0x7fba4a2d4e60>> failed. Details
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:ERROR::djl async engine terminated with error Traceback (most recent call last):
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:  File "/tmp/.djl.ai/python/0.33.0/djl_python/python_async_engine.py", line 121, in catch_all
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:    func()
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:  File "/tmp/.djl.ai/python/0.33.0/djl_python/python_async_engine.py", line 51, in receive_requests
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:    inputs, function_name = self._prepare_inputs()
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:                            ^^^^^^^^^^^^^^^^^^^^^^
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:  File "/tmp/.djl.ai/python/0.33.0/djl_python/python_sync_engine.py", line 101, in _prepare_inputs
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:    inputs.read(self.cl_socket)
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:  File "/tmp/.djl.ai/python/0.33.0/djl_python/inputs.py", line 221, in read
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:    prop_size = retrieve_short(conn)
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:                ^^^^^^^^^^^^^^^^^^^^
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:  File "/tmp/.djl.ai/python/0.33.0/djl_python/inputs.py", line 60, in retrieve_short
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:    data = retrieve_buffer(conn, 2)
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:           ^^^^^^^^^^^^^^^^^^^^^^^^
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:  File "/tmp/.djl.ai/python/0.33.0/djl_python/inputs.py", line 36, in retrieve_buffer
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:    raise ValueError("Connection disconnected")
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:ValueError: Connection disconnected
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:INFO::djl async engine terminated
INFO  PyProcess W-123-mymodel-stdout: [1,0]<stdout>:INFO::128 - Python process finished
INFO  PyProcess W-123-mymodel-stdout: [2,0]<stdout>:[TensorRT-LLM][INFO] Refreshed the MPI local session
WARN  PyProcess W-123-mymodel-stderr: --------------------------------------------------------------------------
WARN  PyProcess W-123-mymodel-stderr: Primary job  terminated normally, but 1 process returned
WARN  PyProcess W-123-mymodel-stderr: a non-zero exit code. Per user-direction, the job has been aborted.
WARN  PyProcess W-123-mymodel-stderr: --------------------------------------------------------------------------
WARN  PyProcess W-123-mymodel-stderr: --------------------------------------------------------------------------
WARN  PyProcess W-123-mymodel-stderr: mpirun detected that one or more processes exited with non-zero status, thus causing
WARN  PyProcess W-123-mymodel-stderr: the job to be terminated. The first process to do so was:
WARN  PyProcess W-123-mymodel-stderr:
WARN  PyProcess W-123-mymodel-stderr:   Process name: [[27905,1],0]
WARN  PyProcess W-123-mymodel-stderr:   Exit code:    1
WARN  PyProcess W-123-mymodel-stderr: --------------------------------------------------------------------------
INFO  PyProcess ReaderThread(0) stopped - W-123-mymodel-stdout
INFO  PyProcess ReaderThread(0) stopped - W-123-mymodel-stderr
WARN  InferenceRequestHandler RequestId=[6d0d67cc-fea0-44d1-8af2-d9563c16a490] Chunk reading interrupted
java.lang.IllegalStateException: Read chunk timeout.
	at ai.djl.inference.streaming.ChunkedBytesSupplier.next(ChunkedBytesSupplier.java:79) ~[api-0.33.0.jar:?]
	at ai.djl.inference.streaming.ChunkedBytesSupplier.nextChunk(ChunkedBytesSupplier.java:93) ~[api-0.33.0.jar:?]
	at ai.djl.serving.http.InferenceRequestHandler.sendOutput(InferenceRequestHandler.java:418) ~[serving-0.33.0.jar:?]
	at ai.djl.serving.http.InferenceRequestHandler.lambda$runJob$5(InferenceRequestHandler.java:313) ~[serving-0.33.0.jar:?]
	at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) [?:?]
	at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) [?:?]
	at java.base/java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:483) [?:?]
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373) [?:?]
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182) [?:?]
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655) [?:?]
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622) [?:?]
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165) [?:?]
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Server crashes when using TensorRT-LLM engine #2859

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Server crashes when using TensorRT-LLM engine #2859

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions