Skip to content

Commit 9ff8589

Browse files
ksuma2109Suma Kasa
andauthored
Release notes for LMIv17 (#2965)
Co-authored-by: Suma Kasa <[email protected]>
1 parent 7d740dc commit 9ff8589

File tree

3 files changed

+23
-13
lines changed

3 files changed

+23
-13
lines changed

serving/docs/lmi/release_notes.md

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,26 @@
33
Below are the release notes for recent Large Model Inference (LMI) images for use on SageMaker.
44
For details on historical releases, refer to the [Github Releases page](https://github.com/deepjavalibrary/djl-serving/releases).
55

6-
## LMI V16 (DJL-Serving 0.34.0)
6+
## LMI V17 (DJL-Serving 0.35.0)
77

88
Meet your brand new image! 💿
99

10+
#### LMI (vLLM) Image – 9-30-2025
11+
```
12+
763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.35.0-lmi17.0.0-cu128
13+
```
14+
* vLLM has been upgraded to `0.11.1`
15+
* Going forward, [async mode](https://github.com/deepjavalibrary/djl-serving/blob/0.35.0-dlc/serving/docs/lmi/user_guides/vllm_user_guide.md#async-mode-configurations) is the default configuration for the vLLM handler
16+
* New models supported - DeepSeek V3.2, Qwen 3 VL, Minimax-M2
17+
* LoRA supported in Async mode for MoE models - Llama4 Scout, Qwen3, DeepSeek, GPT-OSS
18+
* EAGLE 3 support added for GPT-OSS Models
19+
* Support for on-host KV Cache offloading with [LMCache](https://github.com/deepjavalibrary/djl-serving/blob/0.35.0-dlc/serving/docs/lmi/user_guides/lmcache_user_guide.md) (LMCache v1 is in experimental phase).
20+
21+
##### Considerations
22+
* Our benchmarks demonstrate improvement in performance of LMI V17 compared to V16 for all models benchmarked (DeepSeek R1 Distill Llama, Llama 3.1 8B Instruct, Mistral 7B Instruct v0.3) except for Qwen3 Coder 30B A3b Model at concurrency of 128. We are working with vLLM community to understand the root cause and potential fixes.
23+
24+
## LMI V16 (DJL-Serving 0.34.0)
25+
1026
#### LMI (vLLM) Image – 9-30-2025
1127
```
1228
763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.34.0-lmi16.0.0-cu128

serving/docs/lmi/user_guides/tool_calling.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
Tool calling is currently supported in LMI through the [vLLM](vllm_user_guide.md) backend only.
44

5-
Details on vLLM's tool calling support can be found [here](https://docs.vllm.ai/en/v0.10.2/features/tool_calling.html#how-to-write-a-tool-parser-plugin).
5+
Details on vLLM's tool calling support can be found [here](https://docs.vllm.ai/en/v0.11.1/features/tool_calling.html#how-to-write-a-tool-parser-plugin).
66

77
To enable tool calling in LMI, you must set the following environment variable configurations:
88

@@ -12,7 +12,7 @@ OPTION_ENABLE_AUTO_TOOL_CHOICE=true
1212
OPTION_TOOL_CALL_PARSER=<parser_name>
1313
```
1414

15-
You can find built-in tool call parsers [here](https://docs.vllm.ai/en/v0.7.3/features/tool_calling.html#automatic-function-calling).
15+
You can find built-in tool call parsers [here](https://docs.vllm.ai/en/v0.11.1/features/tool_calling.html#automatic-function-calling).
1616

1717
Additionally, you must provide a chat template that supports tool parsing.
1818
You can specify a specific chat template using the `OPTION_CHAT_TEMPLATE=<path/to/template>` environment variable.

serving/docs/lmi/user_guides/vllm_user_guide.md

Lines changed: 4 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,11 @@ vLLM expects the model artifacts to be in the [standard HuggingFace format](../d
88

99
**Text Generation Models**
1010

11-
Here is the list of text generation models supported in [vLLM 0.10.2](https://docs.vllm.ai/en/v0.10.2/models/supported_models.html#decoder-only-language-models).
11+
Here is the list of text generation models supported in [vLLM 0.11.1](https://docs.vllm.ai/en/v0.11.1/models/supported_models.html#decoder-only-language-models).
1212

1313
**Multi Modal Models**
1414

15-
Here is the list of multi-modal models supported in [vLLM 0.10.2](https://docs.vllm.ai/en/v0.10.2/models/supported_models.html#decoder-only-language-models).
15+
Here is the list of multi-modal models supported in [vLLM 0.11.1](https://docs.vllm.ai/en/v0.11.1/models/supported_models.html#decoder-only-language-models).
1616

1717
### Model Coverage in CI
1818

@@ -34,7 +34,7 @@ The following set of models are tested in our nightly tests
3434

3535
## Quantization Support
3636

37-
The quantization techniques supported in vLLM 0.10.2 are listed [here](https://docs.vllm.ai/en/v0.10.2/quantization/supported_hardware.html).
37+
The quantization techniques supported in vLLM 0.11.1 are listed [here](https://docs.vllm.ai/en/latest/features/quantization/).
3838

3939
We recommend that regardless of which quantization technique you are using that you pre-quantize the model.
4040
Runtime quantization adds additional overhead to the endpoint startup time.
@@ -68,7 +68,7 @@ If you omit the `option.quantize` configuration, then the engine will determine
6868

6969
## Quick Start Configurations
7070

71-
Starting with LMI v15, the recommended mode for running vLLM is async mode.
71+
Starting with LMI v17, the default mode for running vLLM is async mode.
7272
Async mode integrates with the vLLM Async Engine via the OpenAI modules.
7373
This ensures that LMI's vLLM support is always in parity with upstream vLLM with respect to both engine-configurations and API schemas.
7474
Async mode will become the default, and only supported mode, in an upcoming release.
@@ -79,19 +79,13 @@ Async mode will become the default, and only supported mode, in an upcoming rele
7979

8080
```
8181
engine=Python
82-
option.async_mode=true
83-
option.rolling_batch=disable
84-
option.entryPoint=djl_python.lmi_vllm.vllm_async_service
8582
option.tensor_parallel_degree=max
8683
```
8784

8885
**environment variables**
8986

9087
```
9188
HF_MODEL_ID=<model id or model path>
92-
OPTION_ASYNC_MODE=true
93-
OPTION_ROLLING_BATCH=disable
94-
OPTION_ENTRYPOINT=djl_python.lmi_vllm.vllm_async_service
9589
TENSOR_PARALLEL_DEGREE=max
9690
```
9791

0 commit comments

Comments
 (0)