Release notes for LMIv17 (#2965)

ksuma2109 · Suma Kasa · web-flow · commit 9ff858981725 · 2025-11-20T13:06:09.000-08:00
Co-authored-by: Suma Kasa &lt;sumakasa@amazon.com&gt;
diff --git a/serving/docs/lmi/release_notes.md b/serving/docs/lmi/release_notes.md
@@ -3,10 +3,26 @@
 Below are the release notes for recent Large Model Inference (LMI) images for use on SageMaker.
 For details on historical releases, refer to the [Github Releases page](https://github.com/deepjavalibrary/djl-serving/releases).
 
-## LMI V16 (DJL-Serving 0.34.0)
+## LMI V17 (DJL-Serving 0.35.0)
 
 Meet your brand new image! 💿
 
+#### LMI (vLLM) Image – 9-30-2025
+```
+763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.35.0-lmi17.0.0-cu128
+```
+* vLLM has been upgraded to `0.11.1`
+* Going forward, [async mode](https://github.com/deepjavalibrary/djl-serving/blob/0.35.0-dlc/serving/docs/lmi/user_guides/vllm_user_guide.md#async-mode-configurations) is the default configuration for the vLLM handler
+* New models supported - DeepSeek V3.2, Qwen 3 VL, Minimax-M2
+* LoRA supported in Async mode for MoE models -  Llama4 Scout, Qwen3, DeepSeek, GPT-OSS
+* EAGLE 3 support added for GPT-OSS Models
+* Support for on-host KV Cache offloading with [LMCache](https://github.com/deepjavalibrary/djl-serving/blob/0.35.0-dlc/serving/docs/lmi/user_guides/lmcache_user_guide.md) (LMCache v1 is in experimental phase).
+
+##### Considerations
+* Our benchmarks demonstrate improvement in performance of LMI V17 compared to V16 for all models benchmarked (DeepSeek R1 Distill Llama, Llama 3.1 8B Instruct, Mistral 7B Instruct v0.3) except for Qwen3 Coder 30B A3b Model at concurrency of 128. We are working with vLLM community to understand the root cause and potential fixes.
+
+## LMI V16 (DJL-Serving 0.34.0)
+
 #### LMI (vLLM) Image – 9-30-2025
 ```
 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.34.0-lmi16.0.0-cu128
diff --git a/serving/docs/lmi/user_guides/tool_calling.md b/serving/docs/lmi/user_guides/tool_calling.md
@@ -2,7 +2,7 @@
 
 Tool calling is currently supported in LMI through the [vLLM](vllm_user_guide.md) backend only.
 
-Details on vLLM's tool calling support can be found [here](https://docs.vllm.ai/en/v0.10.2/features/tool_calling.html#how-to-write-a-tool-parser-plugin).
+Details on vLLM's tool calling support can be found [here](https://docs.vllm.ai/en/v0.11.1/features/tool_calling.html#how-to-write-a-tool-parser-plugin).
 
 To enable tool calling in LMI, you must set the following environment variable configurations:
 
@@ -12,7 +12,7 @@ OPTION_ENABLE_AUTO_TOOL_CHOICE=true
 OPTION_TOOL_CALL_PARSER=<parser_name>
 ```
 
-You can find built-in tool call parsers [here](https://docs.vllm.ai/en/v0.7.3/features/tool_calling.html#automatic-function-calling).
+You can find built-in tool call parsers [here](https://docs.vllm.ai/en/v0.11.1/features/tool_calling.html#automatic-function-calling).
 
 Additionally, you must provide a chat template that supports tool parsing.
 You can specify a specific chat template using the `OPTION_CHAT_TEMPLATE=<path/to/template>` environment variable.
diff --git a/serving/docs/lmi/user_guides/vllm_user_guide.md b/serving/docs/lmi/user_guides/vllm_user_guide.md
@@ -8,11 +8,11 @@ vLLM expects the model artifacts to be in the [standard HuggingFace format](../d
 
 **Text Generation Models**
 
-Here is the list of text generation models supported in [vLLM 0.10.2](https://docs.vllm.ai/en/v0.10.2/models/supported_models.html#decoder-only-language-models).
+Here is the list of text generation models supported in [vLLM 0.11.1](https://docs.vllm.ai/en/v0.11.1/models/supported_models.html#decoder-only-language-models).
 
 **Multi Modal Models**
 
-Here is the list of multi-modal models supported in [vLLM 0.10.2](https://docs.vllm.ai/en/v0.10.2/models/supported_models.html#decoder-only-language-models).
+Here is the list of multi-modal models supported in [vLLM 0.11.1](https://docs.vllm.ai/en/v0.11.1/models/supported_models.html#decoder-only-language-models).
 
 ### Model Coverage in CI
 
@@ -34,7 +34,7 @@ The following set of models are tested in our nightly tests
 
 ## Quantization Support
 
-The quantization techniques supported in vLLM 0.10.2 are listed [here](https://docs.vllm.ai/en/v0.10.2/quantization/supported_hardware.html).
+The quantization techniques supported in vLLM 0.11.1 are listed [here](https://docs.vllm.ai/en/latest/features/quantization/).
 
 We recommend that regardless of which quantization technique you are using that you pre-quantize the model.
 Runtime quantization adds additional overhead to the endpoint startup time.
@@ -68,7 +68,7 @@ If you omit the `option.quantize` configuration, then the engine will determine
 
 ## Quick Start Configurations 
 
-Starting with LMI v15, the recommended mode for running vLLM is async mode.
+Starting with LMI v17, the default mode for running vLLM is async mode.
 Async mode integrates with the vLLM Async Engine via the OpenAI modules. 
 This ensures that LMI's vLLM support is always in parity with upstream vLLM with respect to both engine-configurations and API schemas.
 Async mode will become the default, and only supported mode, in an upcoming release.
@@ -79,19 +79,13 @@ Async mode will become the default, and only supported mode, in an upcoming rele
 
 ```
 engine=Python
-option.async_mode=true
-option.rolling_batch=disable
-option.entryPoint=djl_python.lmi_vllm.vllm_async_service
 option.tensor_parallel_degree=max
 ```
 
 **environment variables**
 
 ```
 HF_MODEL_ID=<model id or model path>
-OPTION_ASYNC_MODE=true
-OPTION_ROLLING_BATCH=disable
-OPTION_ENTRYPOINT=djl_python.lmi_vllm.vllm_async_service
 TENSOR_PARALLEL_DEGREE=max
 ```