Skip to content

Commit c1ac412

Browse files
ksuma2109Suma Kasa
andauthored
Update vLLM user guide with Sample configs (#2966)
Co-authored-by: Suma Kasa <[email protected]>
1 parent 6b14682 commit c1ac412

File tree

1 file changed

+102
-1
lines changed

1 file changed

+102
-1
lines changed

serving/docs/lmi/user_guides/vllm_user_guide.md

Lines changed: 102 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -236,4 +236,105 @@ async def handle(inputs: Input) -> Output:
236236
- Use `create_non_stream_output()` or `handle_streaming_response()` from `djl_python.async_utils` to format the response
237237
- Access model properties via `inputs.get_properties()`
238238
- Parse request data using `decode()` from `djl_python.encode_decode`
239-
- If the custom handler fails or is not found, the system will automatically fall back to the default vLLM handler
239+
- If the custom handler fails or is not found, the system will automatically fall back to the default vLLM handler
240+
241+
### Sample Model Configurations
242+
243+
These are the model configurations tested manually. Please use them as a guide
244+
245+
#### Qwen3 VL 32B Instruct
246+
```
247+
on p4d
248+
249+
SPECIAL REQUIREMENT:
250+
VLLM_ATTENTION_BACKEND=TORCH_SDPA
251+
252+
Constants:
253+
-e HF_MODEL_ID=Qwen/Qwen3-VL-32B-Instruct
254+
-e OPTION_TENSOR_PARALLEL_DEGREE=max
255+
-e VLLM_ATTENTION_BACKEND=TORCH_SDPA
256+
-e OPTION_LIMIT_MM_PER_PROMPT="{\"image\": 4, \"video\": 0}"
257+
258+
Tested varying values for:
259+
-e OPTION_MAX_ROLLING_BATCH_SIZE=128
260+
-e OPTION_MAX_MODEL_LEN=16384
261+
-e OPTION_GPU_MEMORY_UTILIZATION=0.9
262+
```
263+
264+
#### DeepSeek V3.2 Exp Base
265+
```
266+
on p5e
267+
268+
Constants:
269+
-e HF_MODEL_ID=deepseek-ai/DeepSeek-V3.2-Exp-Base
270+
-e OPTION_TENSOR_PARALLEL_DEGREE=8
271+
272+
```
273+
#### Minimax M2
274+
```
275+
on p5.48xl
276+
277+
SPECIAL REQUIREMENT:
278+
OPTION_ENABLE_EXPERT_PARALLEL=true
279+
280+
Constants:
281+
-e HF_MODEL_ID=MiniMaxAI/MiniMax-M2
282+
-e OPTION_TENSOR_PARALLEL_DEGREE=max
283+
-e OPTION_ENABLE_EXPERT_PARALLEL=true
284+
285+
Tested varying values for:
286+
-e OPTION_MAX_ROLLING_BATCH_SIZE=128
287+
-e OPTION_MAX_MODEL_LEN=16384
288+
-e OPTION_GPU_MEMORY_UTILIZATION=0.9
289+
```
290+
291+
#### EAGLE3 Speculative Decoding for GPT-OSS 20B
292+
```
293+
-e HF_MODEL_ID=openai/gpt-oss-20b
294+
-e OPTION_SPECULATIVE_CONFIG='{\"method\": \"eagle3\", \"model\":\"zhuyksirEAGLE3-gpt-oss-20b-bf16\", \"num_speculative_tokens\": 4}'
295+
-e OPTION_TENSOR_PARALLEL_DEGREE=1
296+
-e OPTION_MAX_ROLLING_BATCH_SIZE=4
297+
```
298+
299+
#### Llama Scout 4 with LoRA Adapters
300+
```
301+
Constants:
302+
option.model_id=meta-llama/Llama-4-Scout-17B-16E-Instruct/
303+
option.tensor_parallel_degree=max
304+
option.enable_lora=True
305+
option.max_loras=2
306+
option.max_lora_rank=64
307+
option.long_lora_scaling_factors=4.0
308+
option.gpu_memory_utilization=0.9
309+
option.max_model_len=16384
310+
```
311+
312+
#### Qwen3 Coder with LoRA Adapters
313+
Adapter used: Krish356/qwen3-coder-react-lora-final
314+
```
315+
Constants:
316+
option.model_id=Qwen/Qwen3-Coder-30B-A3B-Instruct
317+
option.tensor_parallel_degree=max
318+
option.enable_lora=True
319+
option.max_loras=2
320+
option.max_lora_rank=64
321+
option.long_lora_scaling_factors=4.0
322+
option.gpu_memory_utilization=0.9
323+
option.max_model_len=16384
324+
```
325+
326+
#### GPT-OSS 20B with LoRA Adapters
327+
Adapter used:
328+
1. waliboii/gpt-oss-20b-promptinj-lora
329+
2. jworks/gpt-oss-20b-uncensored-lora
330+
331+
```
332+
Constants:
333+
option.model_id=openai/gpt-oss-20b
334+
option.tensor_parallel_degree=max
335+
option.enable_lora=True
336+
option.max_loras=2
337+
option.max_lora_rank=64
338+
option.long_lora_scaling_factors=4.0
339+
option.gpu_memory_utilization=0.9
340+
```

0 commit comments

Comments
 (0)