Skip to content

Commit 94d145a

Browse files
docs: Add multimodal documentation vllm, sglang, and trtllm backends (#4510)
Signed-off-by: Indrajit Bhosale <[email protected]> Co-authored-by: krishung5 <[email protected]>
1 parent 09f2314 commit 94d145a

File tree

6 files changed

+895
-164
lines changed

6 files changed

+895
-164
lines changed
Lines changed: 324 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,324 @@
1+
<!--
2+
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
SPDX-License-Identifier: Apache-2.0
4+
5+
Licensed under the Apache License, Version 2.0 (the "License");
6+
you may not use this file except in compliance with the License.
7+
You may obtain a copy of the License at
8+
9+
http://www.apache.org/licenses/LICENSE-2.0
10+
11+
Unless required by applicable law or agreed to in writing, software
12+
distributed under the License is distributed on an "AS IS" BASIS,
13+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
See the License for the specific language governing permissions and
15+
limitations under the License.
16+
-->
17+
18+
# SGLang Multimodal Guide
19+
20+
This document provides a comprehensive guide for multimodal inference using SGLang backend in Dynamo. For more details on the multimodal examples, see [Multimodal Examples Documentation](./multimodal_epd.md).
21+
22+
## Multimodal Support Matrix
23+
24+
| Modality | Input Format | Aggregated | Disaggregated | Notes |
25+
|----------|--------------|------------|---------------|-------|
26+
| **Image** | HTTP/HTTPS URL | ✅ Yes | ✅ Yes | Vision encoder generates embeddings |
27+
| **Image** | Data URL (Base64) | ❌ No | ❌ No | Not supported |
28+
| **Video** | HTTP/HTTPS URL | ❌ No | ❌ No | Not implemented |
29+
| **Audio** | HTTP/HTTPS URL | ❌ No | ❌ No | Not implemented |
30+
31+
## Architecture Comparison
32+
33+
SGLang multimodal supports two deployment patterns:
34+
35+
```text
36+
AGGREGATED (E->PD):
37+
Client → Frontend (Rust) → Processor → Encoder [NIXL] → PD Worker → Response
38+
• 3 components • Vision encoder in Python • NIXL embeddings transfer
39+
40+
DISAGGREGATED (E->P->D):
41+
Client → Frontend → Processor → Encoder [NIXL] → Prefill [bootstrap] → Decode → Response
42+
• 4 components • Vision encoder in Python • KV cache transfer via bootstrap mechanism
43+
```
44+
45+
## Aggregated Mode (E->PD)
46+
47+
In aggregated mode, encoding happens in a separate worker, but prefill and decode share the same engine.
48+
49+
### Architecture
50+
51+
```text
52+
HTTP Frontend (Rust)
53+
54+
Processor (Python - ModelInput.Text - REGISTERED)
55+
↓ tokenizes with chat template, extracts image URL
56+
Encode Worker (Python - NOT registered)
57+
↓ downloads image, runs vision encoder, generates embeddings, NIXL transfer
58+
PD Worker (Python - NOT registered)
59+
↓ receives embeddings via NIXL, prefill + decode
60+
Response → Processor → Frontend
61+
```
62+
63+
### Components
64+
65+
| Component | Flag | ModelInput | Registered | Has SGLang Engine? | Purpose |
66+
|-----------|------|-----------|------------|-------------------|---------|
67+
| Processor | `--multimodal-processor` | Text | ✅ Yes | ❌ No | HTTP entry, OpenAI→SGLang conversion |
68+
| Encode Worker | `--multimodal-encode-worker` | N/A | ❌ No | ❌ No | Vision encoder, embeddings generation |
69+
| PD Worker | `--multimodal-worker` | N/A | ❌ No | ✅ Yes | Prefill + Decode with embeddings |
70+
71+
### Key Characteristics
72+
73+
- **Vision Encoder in Python**: Encode worker loads vision model (AutoModel) and image processor (AutoImageProcessor)
74+
- **Token Expansion**: Single `<|image_pad|>` token replaced with N tokens based on embedding shape
75+
- **NIXL Transfer**: Embeddings transferred from Encoder → PD Worker using NIXL
76+
- **No Rust Processing**: All tokenization and image handling happens in Python
77+
78+
## Disaggregated Mode (E->P->D)
79+
80+
In disaggregated mode, encoding, prefill, and decode are handled by separate workers using SGLang's bootstrap coordination.
81+
82+
### Architecture
83+
84+
```text
85+
HTTP Frontend (Rust)
86+
87+
Processor (Python - ModelInput.Text - REGISTERED)
88+
↓ tokenizes with chat template, extracts image URL
89+
Encode Worker (Python - NOT registered)
90+
↓ downloads image, runs vision encoder, generates embeddings, NIXL transfer
91+
Prefill Worker (Python - NOT registered)
92+
↓ receives embeddings via NIXL, prefill only, returns bootstrap info
93+
Decode Worker (Python - NOT registered)
94+
↓ uses bootstrap info, decode only, token generation
95+
Response → Processor → Frontend
96+
```
97+
98+
### Components
99+
100+
| Component | Flag | ModelInput | Registered | Has SGLang Engine? | Purpose |
101+
|-----------|------|-----------|------------|-------------------|---------|
102+
| Processor | `--multimodal-processor` | Text | ✅ Yes | ❌ No | HTTP entry, OpenAI→SGLang conversion |
103+
| Encode Worker | `--multimodal-encode-worker` | N/A | ❌ No | ❌ No | Vision encoder, embeddings generation |
104+
| Decode Worker | `--multimodal-worker --serving-mode=decode` | N/A | ❌ No | ✅ Yes | **Entry point for disaggregation**, calls Prefill |
105+
| Prefill Worker | `--multimodal-worker --serving-mode=prefill` | N/A | ❌ No | ✅ Yes | Called by Decode, bootstrap coordination |
106+
107+
### Bootstrap Coordination
108+
109+
SGLang disaggregation uses a bootstrap mechanism for P->D coordination:
110+
111+
**Request Flow (Important):**
112+
```text
113+
Client → Frontend → Processor → Encode → DECODE Worker → Prefill Worker
114+
115+
Entry point for disaggregation!
116+
```
117+
118+
**Bootstrap Process:**
119+
1. **Decode Worker** receives request from Encode Worker
120+
2. **Decode Worker** calls Prefill Worker via NATS to request bootstrap info
121+
3. **Prefill Worker** generates `{host, port, room}` and returns immediately
122+
4. **Both workers** connect to same "room" using bootstrap coordinates
123+
5. **SGLang internally** transfers KV cache state via bootstrap connection (not NIXL)
124+
125+
**Key Difference from vLLM:**
126+
- vLLM: Frontend → Prefill → Decode (Prefill is entry point)
127+
- SGLang: Frontend → Processor → Encode → **Decode → Prefill** (Decode is entry point)
128+
129+
## ModelInput Types and Registration
130+
131+
**Only the Processor registers with Dynamo Rust.**
132+
133+
### Registration Pattern
134+
135+
```python
136+
# ONLY Processor registers with Dynamo Rust
137+
await register_llm_with_readiness_gate(
138+
None, # No engine for processor
139+
generate_endpoint,
140+
server_args,
141+
dynamo_args,
142+
input_type=ModelInput.Text, # Receives raw OpenAI format
143+
readiness_gate=ready_event,
144+
)
145+
146+
# Workers do NOT register - they are internal components
147+
# They communicate via NATS clients created in main.py
148+
```
149+
150+
### Component Initialization
151+
152+
```python
153+
# Encode Worker - connects to downstream PD worker
154+
pd_worker_client = (
155+
await runtime.namespace(dynamo_args.namespace)
156+
.component("backend")
157+
.endpoint("generate")
158+
.client()
159+
)
160+
161+
# PD Worker (Decode mode) - connects to upstream Prefill worker
162+
prefill_client = (
163+
await runtime.namespace(dynamo_args.namespace)
164+
.component("prefill")
165+
.endpoint("generate")
166+
.client()
167+
)
168+
```
169+
170+
## Inter-Component Communication
171+
172+
### Control Flow (NATS)
173+
174+
All component-to-component communication happens via NATS:
175+
176+
**Aggregated Mode (E→PD):**
177+
```text
178+
Processor → Encode Worker → PD Worker
179+
(NATS) (NATS + NIXL embeddings)
180+
```
181+
182+
**Disaggregated Mode (E→P→D):**
183+
```text
184+
Processor → Encode Worker → DECODE Worker → Prefill Worker
185+
(NATS) (NATS) (NATS)
186+
187+
Decode requests bootstrap
188+
189+
Prefill returns {host, port, room}
190+
191+
Both connect via bootstrap
192+
193+
SGLang internal KV cache transfer
194+
```
195+
196+
**Detailed Message Flow:**
197+
198+
```text
199+
Processor → Encode Worker:
200+
- NATS round_robin with SglangMultimodalRequest
201+
- Contains: tokenized input_ids, image URL, sampling params
202+
203+
Encode Worker → Decode/PD Worker:
204+
- NATS round_robin to "backend" component
205+
- Contains: expanded token_ids, NIXL metadata, embeddings shape
206+
- NIXL transfer: embeddings tensor
207+
208+
Decode Worker → Prefill Worker (disagg only):
209+
- NATS call to "prefill" component
210+
- Decode requests bootstrap coordinates
211+
- Prefill returns: {bootstrap_host, bootstrap_port, bootstrap_room}
212+
213+
Prefill ↔ Decode (via bootstrap):
214+
- SGLang internal connection (not NATS)
215+
- KV cache state shared via bootstrap mechanism
216+
```
217+
218+
### Data Transfer (NIXL)
219+
220+
NIXL is used only for embedding transfer:
221+
222+
```python
223+
Encode Worker:
224+
descriptor = connect.Descriptor(precomputed_embeddings)
225+
with connector.create_readable(descriptor) as readable:
226+
request.serialized_request = readable.metadata()
227+
# Send request with NIXL metadata
228+
await pd_worker_client.round_robin(request)
229+
await readable.wait_for_completion()
230+
231+
PD Worker:
232+
embeddings = torch.empty(request.embeddings_shape, dtype=torch.float16)
233+
descriptor = connect.Descriptor(embeddings)
234+
read_op = await connector.begin_read(request.serialized_request, descriptor)
235+
await read_op.wait_for_completion()
236+
```
237+
238+
## Vision Encoding Details
239+
240+
### Encode Worker Components
241+
242+
The encode worker loads and runs the vision model in Python:
243+
244+
```python
245+
# Vision components loaded in encode worker
246+
self.image_processor = AutoImageProcessor.from_pretrained(
247+
model_path, trust_remote_code=True
248+
)
249+
self.vision_model = AutoModel.from_pretrained(
250+
model_path,
251+
device_map="auto",
252+
torch_dtype=torch.float16,
253+
trust_remote_code=True
254+
)
255+
```
256+
257+
### Token Expansion Process
258+
259+
1. Processor inserts single image token (e.g., `<|image_pad|>`)
260+
2. Encode worker generates embeddings: `shape = (batch, num_patches, hidden_dim)`
261+
3. Encode worker replaces single token with `num_patches` tokens
262+
4. Downstream worker receives expanded token sequence
263+
264+
Example:
265+
```python
266+
# Before: ["Hello", "<|image_pad|>", "world"]
267+
# After: ["Hello", "<|image_pad|>", "<|image_pad|>", ...(576 tokens), "world"]
268+
```
269+
270+
## Chat Template Processing
271+
272+
SGLang uses its own chat template system:
273+
274+
```python
275+
from sglang.srt.parser.conversation import chat_templates
276+
277+
conv = chat_templates["qwen2-vl"].copy()
278+
conv.append_message(conv.roles[0], f"{conv.image_token} Describe this image")
279+
processed = tokenizer(text=conv.get_prompt(), return_tensors="pt")
280+
```
281+
282+
Supported templates: `qwen2-vl`, `llama-3`, `vicuna`, etc.
283+
284+
## NIXL USE
285+
286+
| Use Case | NIXL Used? | Data Transfer | Notes |
287+
|----------|------------|---------------|-------|
288+
| E→PD Aggregated | ✅ Yes | Encoder → PD (embeddings) | Vision encoder separate |
289+
| E→P→D Disaggregated | ✅ Yes | Encoder → Prefill (embeddings) | KV cache via SGLang bootstrap |
290+
291+
**Key Difference:** SGLang P→D uses bootstrap mechanism, not NIXL for KV cache like vLLM.
292+
293+
## Known Limitations
294+
295+
- **No Data URL support** - Only HTTP/HTTPS URLs supported; `data:image/...` base64 URLs not supported
296+
- **No pre-computed embeddings** - Cannot use `.pt`, `.pth`, `.bin` embedding files; vision encoder runs for every request
297+
- **No video support** - No video encoder implementation
298+
- **No audio support** - No audio encoder implementation
299+
- **Only Processor registers with Dynamo** - Workers are internal components, frontend routes to Processor only
300+
- **Disaggregated routing** - Decode Worker is the entry point (calls Prefill), cannot route directly to Prefill workers
301+
- **Limited model generalization** - Token expansion logic is model-specific; adding new models may require implementation updates
302+
303+
## Supported Models
304+
305+
SGLang multimodal **only supports image-based vision-language models**:
306+
307+
### ✅ Supported (Images Only)
308+
- **Qwen2-VL** / **Qwen2.5-VL** (primary support)
309+
- Models with `AutoImageProcessor` and vision tower
310+
- Models compatible with SGLang's image embedding format
311+
312+
313+
## Key Files
314+
315+
| File | Description |
316+
|------|-------------|
317+
| `components/src/dynamo/sglang/main.py` | Component initialization, only Processor registers |
318+
| `components/src/dynamo/sglang/request_handlers/multimodal/processor_handler.py` | Processor implementation, OpenAI→SGLang |
319+
| `components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py` | Vision encoder, embeddings generation |
320+
| `components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py` | PD/Prefill/Decode workers, NIXL read |
321+
| `components/src/dynamo/sglang/multimodal_utils/multimodal_chat_processor.py` | Chat template processing |
322+
| `components/src/dynamo/sglang/protocol.py` | Request/response data structures |
323+
| `components/src/dynamo/sglang/register.py` | Registration logic (only called for Processor) |
324+

0 commit comments

Comments
 (0)