@@ -236,4 +236,105 @@ async def handle(inputs: Input) -> Output:
236236- Use ` create_non_stream_output() ` or ` handle_streaming_response() ` from ` djl_python.async_utils ` to format the response
237237- Access model properties via ` inputs.get_properties() `
238238- Parse request data using ` decode() ` from ` djl_python.encode_decode `
239- - If the custom handler fails or is not found, the system will automatically fall back to the default vLLM handler
239+ - If the custom handler fails or is not found, the system will automatically fall back to the default vLLM handler
240+
241+ ### Sample Model Configurations
242+
243+ These are the model configurations tested manually. Please use them as a guide
244+
245+ #### Qwen3 VL 32B Instruct
246+ ```
247+ on p4d
248+
249+ SPECIAL REQUIREMENT:
250+ VLLM_ATTENTION_BACKEND=TORCH_SDPA
251+
252+ Constants:
253+ -e HF_MODEL_ID=Qwen/Qwen3-VL-32B-Instruct
254+ -e OPTION_TENSOR_PARALLEL_DEGREE=max
255+ -e VLLM_ATTENTION_BACKEND=TORCH_SDPA
256+ -e OPTION_LIMIT_MM_PER_PROMPT="{\"image\": 4, \"video\": 0}"
257+
258+ Tested varying values for:
259+ -e OPTION_MAX_ROLLING_BATCH_SIZE=128
260+ -e OPTION_MAX_MODEL_LEN=16384
261+ -e OPTION_GPU_MEMORY_UTILIZATION=0.9
262+ ```
263+
264+ #### DeepSeek V3.2 Exp Base
265+ ```
266+ on p5e
267+
268+ Constants:
269+ -e HF_MODEL_ID=deepseek-ai/DeepSeek-V3.2-Exp-Base
270+ -e OPTION_TENSOR_PARALLEL_DEGREE=8
271+
272+ ```
273+ #### Minimax M2
274+ ```
275+ on p5.48xl
276+
277+ SPECIAL REQUIREMENT:
278+ OPTION_ENABLE_EXPERT_PARALLEL=true
279+
280+ Constants:
281+ -e HF_MODEL_ID=MiniMaxAI/MiniMax-M2
282+ -e OPTION_TENSOR_PARALLEL_DEGREE=max
283+ -e OPTION_ENABLE_EXPERT_PARALLEL=true
284+
285+ Tested varying values for:
286+ -e OPTION_MAX_ROLLING_BATCH_SIZE=128
287+ -e OPTION_MAX_MODEL_LEN=16384
288+ -e OPTION_GPU_MEMORY_UTILIZATION=0.9
289+ ```
290+
291+ #### EAGLE3 Speculative Decoding for GPT-OSS 20B
292+ ```
293+ -e HF_MODEL_ID=openai/gpt-oss-20b
294+ -e OPTION_SPECULATIVE_CONFIG='{\"method\": \"eagle3\", \"model\":\"zhuyksirEAGLE3-gpt-oss-20b-bf16\", \"num_speculative_tokens\": 4}'
295+ -e OPTION_TENSOR_PARALLEL_DEGREE=1
296+ -e OPTION_MAX_ROLLING_BATCH_SIZE=4
297+ ```
298+
299+ #### Llama Scout 4 with LoRA Adapters
300+ ```
301+ Constants:
302+ option.model_id=meta-llama/Llama-4-Scout-17B-16E-Instruct/
303+ option.tensor_parallel_degree=max
304+ option.enable_lora=True
305+ option.max_loras=2
306+ option.max_lora_rank=64
307+ option.long_lora_scaling_factors=4.0
308+ option.gpu_memory_utilization=0.9
309+ option.max_model_len=16384
310+ ```
311+
312+ #### Qwen3 Coder with LoRA Adapters
313+ Adapter used: Krish356/qwen3-coder-react-lora-final
314+ ```
315+ Constants:
316+ option.model_id=Qwen/Qwen3-Coder-30B-A3B-Instruct
317+ option.tensor_parallel_degree=max
318+ option.enable_lora=True
319+ option.max_loras=2
320+ option.max_lora_rank=64
321+ option.long_lora_scaling_factors=4.0
322+ option.gpu_memory_utilization=0.9
323+ option.max_model_len=16384
324+ ```
325+
326+ #### GPT-OSS 20B with LoRA Adapters
327+ Adapter used:
328+ 1 . waliboii/gpt-oss-20b-promptinj-lora
329+ 2 . jworks/gpt-oss-20b-uncensored-lora
330+
331+ ```
332+ Constants:
333+ option.model_id=openai/gpt-oss-20b
334+ option.tensor_parallel_degree=max
335+ option.enable_lora=True
336+ option.max_loras=2
337+ option.max_lora_rank=64
338+ option.long_lora_scaling_factors=4.0
339+ option.gpu_memory_utilization=0.9
340+ ```
0 commit comments