generated from amazon-archives/__template_Apache-2.0
-
Notifications
You must be signed in to change notification settings - Fork 82
Open
Labels
Description
Hello.
I have pulled 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.33.0-tensorrtllm0.21.0-cu128 container to use it for serving of Llama based model.
Here is my serving.properties file content:
engine=MPI
option.rolling_batch=trtllm
option.trust_remote_code=true
option.max_input_len=32768
option.max_output_len=32768
option.max_num_tokens=32768
option.max_rolling_batch_size=32
option.tensor_parallel_degree=1
option.enable_streaming=false
Server is being started, launch is successful.
INFO WorkerPool scaling up min workers by 1 (from 0 to 1) workers. Total range is min 1 to max 1
INFO WorkerThread Starting worker thread WT-0001 for model mymodel (M-0001, READY) on device gpu(0)
INFO ModelServer Initialize BOTH server with: EpollServerSocketChannel.
INFO ModelServer BOTH API bind to: http://0.0.0.0:8080
Any request with valid/invalid schema causing server to crash:
For example :
INFO PyProcess W-123-mymodel-stdout: [2,0]<stdout>:[TensorRT-LLM][INFO] [MemUsageChange] Allocated 4608.01 MiB for execution context memory.
INFO PyProcess W-123-mymodel-stdout: [2,0]<stdout>:[TensorRT-LLM][INFO] gatherContextLogits: 0
INFO PyProcess W-123-mymodel-stdout: [2,0]<stdout>:[TensorRT-LLM][INFO] gatherGenerationLogits: 0
INFO PyProcess W-123-mymodel-stdout: [2,0]<stdout>:[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 15380 (MiB)
INFO PyProcess W-123-mymodel-stdout: [2,0]<stdout>:[TensorRT-LLM][INFO] [MemUsageChange] Allocated 95.78 MB GPU memory for runtime buffers.
INFO PyProcess W-123-mymodel-stdout: [2,0]<stdout>:[TensorRT-LLM][INFO] [MemUsageChange] Allocated 128.63 MB GPU memory for decoder.
INFO PyProcess W-123-mymodel-stdout: [2,0]<stdout>:[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 44.52 GiB, available: 23.82 GiB, extraCostMemory: 0.00 GiB
INFO PyProcess W-123-mymodel-stdout: [2,0]<stdout>:[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 5488
INFO PyProcess W-123-mymodel-stdout: [2,0]<stdout>:[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
INFO PyProcess W-123-mymodel-stdout: [2,0]<stdout>:[TensorRT-LLM][INFO] before Create KVCacheManager cacheTransPreAllocaSize:0
INFO PyProcess W-123-mymodel-stdout: [2,0]<stdout>:[TensorRT-LLM][INFO] Max KV cache pages per sequence: 4096 [window size=131072]
INFO PyProcess W-123-mymodel-stdout: [2,0]<stdout>:[TensorRT-LLM][INFO] Number of tokens per block: 32.
INFO PyProcess W-123-mymodel-stdout: [2,0]<stdout>:[TensorRT-LLM][INFO] [MemUsageChange] Allocated 21.44 GiB for max tokens in paged KV cache (175616).
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>:INFO::trtllm service initialized
INFO AsyncRequestManager process is not ready
INFO PyProcess Model [mymodel] initialized.
INFO PyModel mymodel model loaded in 58569 ms.
INFO WorkerPool scaling up min workers by 1 (from 0 to 1) workers. Total range is min 1 to max 1
INFO WorkerThread Starting worker thread WT-0001 for model chat_film (M-0001, READY) on device gpu(0)
INFO ModelServer Initialize BOTH server with: EpollServerSocketChannel.
INFO ModelServer BOTH API bind to: http://0.0.0.0:8080
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>:ERROR::Input parsing failed
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>:Traceback (most recent call last):
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>: File "/tmp/.djl.ai/python/0.33.0/djl_python/lmi_trtllm/trtllm_async_service.py", line 147, in inference
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>: processed_request = self.preprocess_requests(inputs)
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>: File "/tmp/.djl.ai/python/0.33.0/djl_python/lmi_trtllm/trtllm_async_service.py", line 123, in preprocess_requests
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>: request = ChatCompletionRequest(**decoded_payload)
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>: File "/usr/local/lib/python3.12/dist-packages/pydantic/main.py", line 253, in __init__
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>: validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>:pydantic_core._pydantic_core.ValidationError: 1 validation error for ChatCompletionRequest
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>:model_name
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>: Extra inputs are not permitted [type=extra_forbidden, input_value='mymodel', input_type=str]
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>: For further information visit https://errors.pydantic.dev/2.11/v/extra_forbidden
ERROR Connection Exception reading Output from python process
java.lang.NullPointerException: Cannot invoke "ai.djl.python.engine.Request.addResponse(byte[], java.util.Map)" because "request" is null
at ai.djl.python.engine.AsyncRequestManager.sendInferenceResponse(AsyncRequestManager.java:129) ~[python-0.33.0.jar:?]
at ai.djl.python.engine.AsyncRequestManager.addOutput(AsyncRequestManager.java:86) ~[python-0.33.0.jar:?]
at ai.djl.python.engine.Connection$RequestHandler.channelRead0(Connection.java:460) ~[python-0.33.0.jar:?]
at ai.djl.python.engine.Connection$RequestHandler.channelRead0(Connection.java:443) ~[python-0.33.0.jar:?]
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) ~[netty-transport-4.1.119.Final.jar:4.1.119.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) [netty-transport-4.1.119.Final.jar:4.1.119.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.119.Final.jar:4.1.119.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.119.Final.jar:4.1.119.Final]
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346) [netty-codec-4.1.119.Final.jar:4.1.119.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318) [netty-codec-4.1.119.Final.jar:4.1.119.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) [netty-transport-4.1.119.Final.jar:4.1.119.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.119.Final.jar:4.1.119.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.119.Final.jar:4.1.119.Final]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1357) [netty-transport-4.1.119.Final.jar:4.1.119.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) [netty-transport-4.1.119.Final.jar:4.1.119.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.119.Final.jar:4.1.119.Final]
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:868) [netty-transport-4.1.119.Final.jar:4.1.119.Final]
at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:799) [netty-transport-classes-epoll-4.1.119.Final.jar:4.1.119.Final]
at io.netty.channel.epoll.EpollDomainSocketChannel$EpollDomainUnsafe.epollInReady(EpollDomainSocketChannel.java:138) [netty-transport-classes-epoll-4.1.119.Final.jar:4.1.119.Final]
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:501) [netty-transport-classes-epoll-4.1.119.Final.jar:4.1.119.Final]
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:399) [netty-transport-classes-epoll-4.1.119.Final.jar:4.1.119.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998) [netty-common-4.1.119.Final.jar:4.1.119.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.119.Final.jar:4.1.119.Final]
at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>:ERROR::<bound method PythonAsyncEngine.receive_requests of <djl_python.python_async_engine.PythonAsyncEngine object at 0x7fba4a2d4e60>> failed. Details Connection disconnected
ERROR PyProcess predict[init=false] exception: java.util.concurrent.ExecutionException
INFO PyProcess Stop process: 0:123, failure=true
INFO PyProcess Failure count: 0
ERROR PyProcess Restarting python worker
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>:ERROR::<bound method PythonAsyncEngine.send_responses of <djl_python.python_async_engine.PythonAsyncEngine object at 0x7fba4a2d4e60>> failed. Details
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>:ERROR::djl async engine terminated with error Traceback (most recent call last):
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>: File "/tmp/.djl.ai/python/0.33.0/djl_python/python_async_engine.py", line 121, in catch_all
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>: func()
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>: File "/tmp/.djl.ai/python/0.33.0/djl_python/python_async_engine.py", line 51, in receive_requests
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>: inputs, function_name = self._prepare_inputs()
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>: ^^^^^^^^^^^^^^^^^^^^^^
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>: File "/tmp/.djl.ai/python/0.33.0/djl_python/python_sync_engine.py", line 101, in _prepare_inputs
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>: inputs.read(self.cl_socket)
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>: File "/tmp/.djl.ai/python/0.33.0/djl_python/inputs.py", line 221, in read
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>: prop_size = retrieve_short(conn)
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>: ^^^^^^^^^^^^^^^^^^^^
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>: File "/tmp/.djl.ai/python/0.33.0/djl_python/inputs.py", line 60, in retrieve_short
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>: data = retrieve_buffer(conn, 2)
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>: ^^^^^^^^^^^^^^^^^^^^^^^^
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>: File "/tmp/.djl.ai/python/0.33.0/djl_python/inputs.py", line 36, in retrieve_buffer
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>: raise ValueError("Connection disconnected")
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>:ValueError: Connection disconnected
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>:
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>:INFO::djl async engine terminated
INFO PyProcess W-123-mymodel-stdout: [1,0]<stdout>:INFO::128 - Python process finished
INFO PyProcess W-123-mymodel-stdout: [2,0]<stdout>:[TensorRT-LLM][INFO] Refreshed the MPI local session
WARN PyProcess W-123-mymodel-stderr: --------------------------------------------------------------------------
WARN PyProcess W-123-mymodel-stderr: Primary job terminated normally, but 1 process returned
WARN PyProcess W-123-mymodel-stderr: a non-zero exit code. Per user-direction, the job has been aborted.
WARN PyProcess W-123-mymodel-stderr: --------------------------------------------------------------------------
WARN PyProcess W-123-mymodel-stderr: --------------------------------------------------------------------------
WARN PyProcess W-123-mymodel-stderr: mpirun detected that one or more processes exited with non-zero status, thus causing
WARN PyProcess W-123-mymodel-stderr: the job to be terminated. The first process to do so was:
WARN PyProcess W-123-mymodel-stderr:
WARN PyProcess W-123-mymodel-stderr: Process name: [[27905,1],0]
WARN PyProcess W-123-mymodel-stderr: Exit code: 1
WARN PyProcess W-123-mymodel-stderr: --------------------------------------------------------------------------
INFO PyProcess ReaderThread(0) stopped - W-123-mymodel-stdout
INFO PyProcess ReaderThread(0) stopped - W-123-mymodel-stderr
WARN InferenceRequestHandler RequestId=[6d0d67cc-fea0-44d1-8af2-d9563c16a490] Chunk reading interrupted
java.lang.IllegalStateException: Read chunk timeout.
at ai.djl.inference.streaming.ChunkedBytesSupplier.next(ChunkedBytesSupplier.java:79) ~[api-0.33.0.jar:?]
at ai.djl.inference.streaming.ChunkedBytesSupplier.nextChunk(ChunkedBytesSupplier.java:93) ~[api-0.33.0.jar:?]
at ai.djl.serving.http.InferenceRequestHandler.sendOutput(InferenceRequestHandler.java:418) ~[serving-0.33.0.jar:?]
at ai.djl.serving.http.InferenceRequestHandler.lambda$runJob$5(InferenceRequestHandler.java:313) ~[serving-0.33.0.jar:?]
at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) [?:?]
at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) [?:?]
at java.base/java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:483) [?:?]
at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373) [?:?]
at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182) [?:?]
at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655) [?:?]
at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622) [?:?]
at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165) [?:?]