You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Going forward, [async mode](https://github.com/deepjavalibrary/djl-serving/blob/0.35.0-dlc/serving/docs/lmi/user_guides/vllm_user_guide.md#async-mode-configurations) is the default configuration for the vLLM handler
* LoRA supported in Async mode for MoE models - Llama4 Scout, Qwen3, DeepSeek, GPT-OSS
18
+
* EAGLE 3 support added for GPT-OSS Models
19
+
* Support for on-host KV Cache offloading with [LMCache](https://github.com/deepjavalibrary/djl-serving/blob/0.35.0-dlc/serving/docs/lmi/user_guides/lmcache_user_guide.md) (LMCache v1 is in experimental phase).
20
+
21
+
##### Considerations
22
+
* Our benchmarks demonstrate improvement in performance of LMI V17 compared to V16 for all models benchmarked (DeepSeek R1 Distill Llama, Llama 3.1 8B Instruct, Mistral 7B Instruct v0.3) except for Qwen3 Coder 30B A3b Model at concurrency of 128. We are working with vLLM community to understand the root cause and potential fixes.
Copy file name to clipboardExpand all lines: serving/docs/lmi/user_guides/tool_calling.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
Tool calling is currently supported in LMI through the [vLLM](vllm_user_guide.md) backend only.
4
4
5
-
Details on vLLM's tool calling support can be found [here](https://docs.vllm.ai/en/v0.10.2/features/tool_calling.html#how-to-write-a-tool-parser-plugin).
5
+
Details on vLLM's tool calling support can be found [here](https://docs.vllm.ai/en/v0.11.1/features/tool_calling.html#how-to-write-a-tool-parser-plugin).
6
6
7
7
To enable tool calling in LMI, you must set the following environment variable configurations:
Copy file name to clipboardExpand all lines: serving/docs/lmi/user_guides/vllm_user_guide.md
+4-10Lines changed: 4 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,11 +8,11 @@ vLLM expects the model artifacts to be in the [standard HuggingFace format](../d
8
8
9
9
**Text Generation Models**
10
10
11
-
Here is the list of text generation models supported in [vLLM 0.10.2](https://docs.vllm.ai/en/v0.10.2/models/supported_models.html#decoder-only-language-models).
11
+
Here is the list of text generation models supported in [vLLM 0.11.1](https://docs.vllm.ai/en/v0.11.1/models/supported_models.html#decoder-only-language-models).
12
12
13
13
**Multi Modal Models**
14
14
15
-
Here is the list of multi-modal models supported in [vLLM 0.10.2](https://docs.vllm.ai/en/v0.10.2/models/supported_models.html#decoder-only-language-models).
15
+
Here is the list of multi-modal models supported in [vLLM 0.11.1](https://docs.vllm.ai/en/v0.11.1/models/supported_models.html#decoder-only-language-models).
16
16
17
17
### Model Coverage in CI
18
18
@@ -34,7 +34,7 @@ The following set of models are tested in our nightly tests
34
34
35
35
## Quantization Support
36
36
37
-
The quantization techniques supported in vLLM 0.10.2 are listed [here](https://docs.vllm.ai/en/v0.10.2/quantization/supported_hardware.html).
37
+
The quantization techniques supported in vLLM 0.11.1 are listed [here](https://docs.vllm.ai/en/latest/features/quantization/).
38
38
39
39
We recommend that regardless of which quantization technique you are using that you pre-quantize the model.
40
40
Runtime quantization adds additional overhead to the endpoint startup time.
@@ -68,7 +68,7 @@ If you omit the `option.quantize` configuration, then the engine will determine
68
68
69
69
## Quick Start Configurations
70
70
71
-
Starting with LMI v15, the recommended mode for running vLLM is async mode.
71
+
Starting with LMI v17, the default mode for running vLLM is async mode.
72
72
Async mode integrates with the vLLM Async Engine via the OpenAI modules.
73
73
This ensures that LMI's vLLM support is always in parity with upstream vLLM with respect to both engine-configurations and API schemas.
74
74
Async mode will become the default, and only supported mode, in an upcoming release.
@@ -79,19 +79,13 @@ Async mode will become the default, and only supported mode, in an upcoming rele
0 commit comments