|
| 1 | +<!-- |
| 2 | +SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| 3 | +SPDX-License-Identifier: Apache-2.0 |
| 4 | +
|
| 5 | +Licensed under the Apache License, Version 2.0 (the "License"); |
| 6 | +you may not use this file except in compliance with the License. |
| 7 | +You may obtain a copy of the License at |
| 8 | +
|
| 9 | +http://www.apache.org/licenses/LICENSE-2.0 |
| 10 | +
|
| 11 | +Unless required by applicable law or agreed to in writing, software |
| 12 | +distributed under the License is distributed on an "AS IS" BASIS, |
| 13 | +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 14 | +See the License for the specific language governing permissions and |
| 15 | +limitations under the License. |
| 16 | +--> |
| 17 | + |
| 18 | +# SGLang Multimodal Guide |
| 19 | + |
| 20 | +This document provides a comprehensive guide for multimodal inference using SGLang backend in Dynamo. For more details on the multimodal examples, see [Multimodal Examples Documentation](./multimodal_epd.md). |
| 21 | + |
| 22 | +## Multimodal Support Matrix |
| 23 | + |
| 24 | +| Modality | Input Format | Aggregated | Disaggregated | Notes | |
| 25 | +|----------|--------------|------------|---------------|-------| |
| 26 | +| **Image** | HTTP/HTTPS URL | ✅ Yes | ✅ Yes | Vision encoder generates embeddings | |
| 27 | +| **Image** | Data URL (Base64) | ❌ No | ❌ No | Not supported | |
| 28 | +| **Video** | HTTP/HTTPS URL | ❌ No | ❌ No | Not implemented | |
| 29 | +| **Audio** | HTTP/HTTPS URL | ❌ No | ❌ No | Not implemented | |
| 30 | + |
| 31 | +## Architecture Comparison |
| 32 | + |
| 33 | +SGLang multimodal supports two deployment patterns: |
| 34 | + |
| 35 | +```text |
| 36 | +AGGREGATED (E->PD): |
| 37 | + Client → Frontend (Rust) → Processor → Encoder [NIXL] → PD Worker → Response |
| 38 | + • 3 components • Vision encoder in Python • NIXL embeddings transfer |
| 39 | +
|
| 40 | +DISAGGREGATED (E->P->D): |
| 41 | + Client → Frontend → Processor → Encoder [NIXL] → Prefill [bootstrap] → Decode → Response |
| 42 | + • 4 components • Vision encoder in Python • KV cache transfer via bootstrap mechanism |
| 43 | +``` |
| 44 | + |
| 45 | +## Aggregated Mode (E->PD) |
| 46 | + |
| 47 | +In aggregated mode, encoding happens in a separate worker, but prefill and decode share the same engine. |
| 48 | + |
| 49 | +### Architecture |
| 50 | + |
| 51 | +```text |
| 52 | +HTTP Frontend (Rust) |
| 53 | + ↓ |
| 54 | +Processor (Python - ModelInput.Text - REGISTERED) |
| 55 | + ↓ tokenizes with chat template, extracts image URL |
| 56 | +Encode Worker (Python - NOT registered) |
| 57 | + ↓ downloads image, runs vision encoder, generates embeddings, NIXL transfer |
| 58 | +PD Worker (Python - NOT registered) |
| 59 | + ↓ receives embeddings via NIXL, prefill + decode |
| 60 | +Response → Processor → Frontend |
| 61 | +``` |
| 62 | + |
| 63 | +### Components |
| 64 | + |
| 65 | +| Component | Flag | ModelInput | Registered | Has SGLang Engine? | Purpose | |
| 66 | +|-----------|------|-----------|------------|-------------------|---------| |
| 67 | +| Processor | `--multimodal-processor` | Text | ✅ Yes | ❌ No | HTTP entry, OpenAI→SGLang conversion | |
| 68 | +| Encode Worker | `--multimodal-encode-worker` | N/A | ❌ No | ❌ No | Vision encoder, embeddings generation | |
| 69 | +| PD Worker | `--multimodal-worker` | N/A | ❌ No | ✅ Yes | Prefill + Decode with embeddings | |
| 70 | + |
| 71 | +### Key Characteristics |
| 72 | + |
| 73 | +- **Vision Encoder in Python**: Encode worker loads vision model (AutoModel) and image processor (AutoImageProcessor) |
| 74 | +- **Token Expansion**: Single `<|image_pad|>` token replaced with N tokens based on embedding shape |
| 75 | +- **NIXL Transfer**: Embeddings transferred from Encoder → PD Worker using NIXL |
| 76 | +- **No Rust Processing**: All tokenization and image handling happens in Python |
| 77 | + |
| 78 | +## Disaggregated Mode (E->P->D) |
| 79 | + |
| 80 | +In disaggregated mode, encoding, prefill, and decode are handled by separate workers using SGLang's bootstrap coordination. |
| 81 | + |
| 82 | +### Architecture |
| 83 | + |
| 84 | +```text |
| 85 | +HTTP Frontend (Rust) |
| 86 | + ↓ |
| 87 | +Processor (Python - ModelInput.Text - REGISTERED) |
| 88 | + ↓ tokenizes with chat template, extracts image URL |
| 89 | +Encode Worker (Python - NOT registered) |
| 90 | + ↓ downloads image, runs vision encoder, generates embeddings, NIXL transfer |
| 91 | +Prefill Worker (Python - NOT registered) |
| 92 | + ↓ receives embeddings via NIXL, prefill only, returns bootstrap info |
| 93 | +Decode Worker (Python - NOT registered) |
| 94 | + ↓ uses bootstrap info, decode only, token generation |
| 95 | +Response → Processor → Frontend |
| 96 | +``` |
| 97 | + |
| 98 | +### Components |
| 99 | + |
| 100 | +| Component | Flag | ModelInput | Registered | Has SGLang Engine? | Purpose | |
| 101 | +|-----------|------|-----------|------------|-------------------|---------| |
| 102 | +| Processor | `--multimodal-processor` | Text | ✅ Yes | ❌ No | HTTP entry, OpenAI→SGLang conversion | |
| 103 | +| Encode Worker | `--multimodal-encode-worker` | N/A | ❌ No | ❌ No | Vision encoder, embeddings generation | |
| 104 | +| Decode Worker | `--multimodal-worker --serving-mode=decode` | N/A | ❌ No | ✅ Yes | **Entry point for disaggregation**, calls Prefill | |
| 105 | +| Prefill Worker | `--multimodal-worker --serving-mode=prefill` | N/A | ❌ No | ✅ Yes | Called by Decode, bootstrap coordination | |
| 106 | + |
| 107 | +### Bootstrap Coordination |
| 108 | + |
| 109 | +SGLang disaggregation uses a bootstrap mechanism for P->D coordination: |
| 110 | + |
| 111 | +**Request Flow (Important):** |
| 112 | +```text |
| 113 | +Client → Frontend → Processor → Encode → DECODE Worker → Prefill Worker |
| 114 | + ↑ |
| 115 | + Entry point for disaggregation! |
| 116 | +``` |
| 117 | + |
| 118 | +**Bootstrap Process:** |
| 119 | +1. **Decode Worker** receives request from Encode Worker |
| 120 | +2. **Decode Worker** calls Prefill Worker via NATS to request bootstrap info |
| 121 | +3. **Prefill Worker** generates `{host, port, room}` and returns immediately |
| 122 | +4. **Both workers** connect to same "room" using bootstrap coordinates |
| 123 | +5. **SGLang internally** transfers KV cache state via bootstrap connection (not NIXL) |
| 124 | + |
| 125 | +**Key Difference from vLLM:** |
| 126 | +- vLLM: Frontend → Prefill → Decode (Prefill is entry point) |
| 127 | +- SGLang: Frontend → Processor → Encode → **Decode → Prefill** (Decode is entry point) |
| 128 | + |
| 129 | +## ModelInput Types and Registration |
| 130 | + |
| 131 | +**Only the Processor registers with Dynamo Rust.** |
| 132 | + |
| 133 | +### Registration Pattern |
| 134 | + |
| 135 | +```python |
| 136 | +# ONLY Processor registers with Dynamo Rust |
| 137 | +await register_llm_with_readiness_gate( |
| 138 | + None, # No engine for processor |
| 139 | + generate_endpoint, |
| 140 | + server_args, |
| 141 | + dynamo_args, |
| 142 | + input_type=ModelInput.Text, # Receives raw OpenAI format |
| 143 | + readiness_gate=ready_event, |
| 144 | +) |
| 145 | + |
| 146 | +# Workers do NOT register - they are internal components |
| 147 | +# They communicate via NATS clients created in main.py |
| 148 | +``` |
| 149 | + |
| 150 | +### Component Initialization |
| 151 | + |
| 152 | +```python |
| 153 | +# Encode Worker - connects to downstream PD worker |
| 154 | +pd_worker_client = ( |
| 155 | + await runtime.namespace(dynamo_args.namespace) |
| 156 | + .component("backend") |
| 157 | + .endpoint("generate") |
| 158 | + .client() |
| 159 | +) |
| 160 | + |
| 161 | +# PD Worker (Decode mode) - connects to upstream Prefill worker |
| 162 | +prefill_client = ( |
| 163 | + await runtime.namespace(dynamo_args.namespace) |
| 164 | + .component("prefill") |
| 165 | + .endpoint("generate") |
| 166 | + .client() |
| 167 | +) |
| 168 | +``` |
| 169 | + |
| 170 | +## Inter-Component Communication |
| 171 | + |
| 172 | +### Control Flow (NATS) |
| 173 | + |
| 174 | +All component-to-component communication happens via NATS: |
| 175 | + |
| 176 | +**Aggregated Mode (E→PD):** |
| 177 | +```text |
| 178 | +Processor → Encode Worker → PD Worker |
| 179 | + (NATS) (NATS + NIXL embeddings) |
| 180 | +``` |
| 181 | + |
| 182 | +**Disaggregated Mode (E→P→D):** |
| 183 | +```text |
| 184 | +Processor → Encode Worker → DECODE Worker → Prefill Worker |
| 185 | + (NATS) (NATS) (NATS) |
| 186 | + ↓ |
| 187 | + Decode requests bootstrap |
| 188 | + ↓ |
| 189 | + Prefill returns {host, port, room} |
| 190 | + ↓ |
| 191 | + Both connect via bootstrap |
| 192 | + ↓ |
| 193 | + SGLang internal KV cache transfer |
| 194 | +``` |
| 195 | + |
| 196 | +**Detailed Message Flow:** |
| 197 | + |
| 198 | +```text |
| 199 | +Processor → Encode Worker: |
| 200 | + - NATS round_robin with SglangMultimodalRequest |
| 201 | + - Contains: tokenized input_ids, image URL, sampling params |
| 202 | +
|
| 203 | +Encode Worker → Decode/PD Worker: |
| 204 | + - NATS round_robin to "backend" component |
| 205 | + - Contains: expanded token_ids, NIXL metadata, embeddings shape |
| 206 | + - NIXL transfer: embeddings tensor |
| 207 | +
|
| 208 | +Decode Worker → Prefill Worker (disagg only): |
| 209 | + - NATS call to "prefill" component |
| 210 | + - Decode requests bootstrap coordinates |
| 211 | + - Prefill returns: {bootstrap_host, bootstrap_port, bootstrap_room} |
| 212 | +
|
| 213 | +Prefill ↔ Decode (via bootstrap): |
| 214 | + - SGLang internal connection (not NATS) |
| 215 | + - KV cache state shared via bootstrap mechanism |
| 216 | +``` |
| 217 | + |
| 218 | +### Data Transfer (NIXL) |
| 219 | + |
| 220 | +NIXL is used only for embedding transfer: |
| 221 | + |
| 222 | +```python |
| 223 | +Encode Worker: |
| 224 | + descriptor = connect.Descriptor(precomputed_embeddings) |
| 225 | + with connector.create_readable(descriptor) as readable: |
| 226 | + request.serialized_request = readable.metadata() |
| 227 | + # Send request with NIXL metadata |
| 228 | + await pd_worker_client.round_robin(request) |
| 229 | + await readable.wait_for_completion() |
| 230 | + |
| 231 | +PD Worker: |
| 232 | + embeddings = torch.empty(request.embeddings_shape, dtype=torch.float16) |
| 233 | + descriptor = connect.Descriptor(embeddings) |
| 234 | + read_op = await connector.begin_read(request.serialized_request, descriptor) |
| 235 | + await read_op.wait_for_completion() |
| 236 | +``` |
| 237 | + |
| 238 | +## Vision Encoding Details |
| 239 | + |
| 240 | +### Encode Worker Components |
| 241 | + |
| 242 | +The encode worker loads and runs the vision model in Python: |
| 243 | + |
| 244 | +```python |
| 245 | +# Vision components loaded in encode worker |
| 246 | +self.image_processor = AutoImageProcessor.from_pretrained( |
| 247 | + model_path, trust_remote_code=True |
| 248 | +) |
| 249 | +self.vision_model = AutoModel.from_pretrained( |
| 250 | + model_path, |
| 251 | + device_map="auto", |
| 252 | + torch_dtype=torch.float16, |
| 253 | + trust_remote_code=True |
| 254 | +) |
| 255 | +``` |
| 256 | + |
| 257 | +### Token Expansion Process |
| 258 | + |
| 259 | +1. Processor inserts single image token (e.g., `<|image_pad|>`) |
| 260 | +2. Encode worker generates embeddings: `shape = (batch, num_patches, hidden_dim)` |
| 261 | +3. Encode worker replaces single token with `num_patches` tokens |
| 262 | +4. Downstream worker receives expanded token sequence |
| 263 | + |
| 264 | +Example: |
| 265 | +```python |
| 266 | +# Before: ["Hello", "<|image_pad|>", "world"] |
| 267 | +# After: ["Hello", "<|image_pad|>", "<|image_pad|>", ...(576 tokens), "world"] |
| 268 | +``` |
| 269 | + |
| 270 | +## Chat Template Processing |
| 271 | + |
| 272 | +SGLang uses its own chat template system: |
| 273 | + |
| 274 | +```python |
| 275 | +from sglang.srt.parser.conversation import chat_templates |
| 276 | + |
| 277 | +conv = chat_templates["qwen2-vl"].copy() |
| 278 | +conv.append_message(conv.roles[0], f"{conv.image_token} Describe this image") |
| 279 | +processed = tokenizer(text=conv.get_prompt(), return_tensors="pt") |
| 280 | +``` |
| 281 | + |
| 282 | +Supported templates: `qwen2-vl`, `llama-3`, `vicuna`, etc. |
| 283 | + |
| 284 | +## NIXL USE |
| 285 | + |
| 286 | +| Use Case | NIXL Used? | Data Transfer | Notes | |
| 287 | +|----------|------------|---------------|-------| |
| 288 | +| E→PD Aggregated | ✅ Yes | Encoder → PD (embeddings) | Vision encoder separate | |
| 289 | +| E→P→D Disaggregated | ✅ Yes | Encoder → Prefill (embeddings) | KV cache via SGLang bootstrap | |
| 290 | + |
| 291 | +**Key Difference:** SGLang P→D uses bootstrap mechanism, not NIXL for KV cache like vLLM. |
| 292 | + |
| 293 | +## Known Limitations |
| 294 | + |
| 295 | +- **No Data URL support** - Only HTTP/HTTPS URLs supported; `data:image/...` base64 URLs not supported |
| 296 | +- **No pre-computed embeddings** - Cannot use `.pt`, `.pth`, `.bin` embedding files; vision encoder runs for every request |
| 297 | +- **No video support** - No video encoder implementation |
| 298 | +- **No audio support** - No audio encoder implementation |
| 299 | +- **Only Processor registers with Dynamo** - Workers are internal components, frontend routes to Processor only |
| 300 | +- **Disaggregated routing** - Decode Worker is the entry point (calls Prefill), cannot route directly to Prefill workers |
| 301 | +- **Limited model generalization** - Token expansion logic is model-specific; adding new models may require implementation updates |
| 302 | + |
| 303 | +## Supported Models |
| 304 | + |
| 305 | +SGLang multimodal **only supports image-based vision-language models**: |
| 306 | + |
| 307 | +### ✅ Supported (Images Only) |
| 308 | +- **Qwen2-VL** / **Qwen2.5-VL** (primary support) |
| 309 | +- Models with `AutoImageProcessor` and vision tower |
| 310 | +- Models compatible with SGLang's image embedding format |
| 311 | + |
| 312 | + |
| 313 | +## Key Files |
| 314 | + |
| 315 | +| File | Description | |
| 316 | +|------|-------------| |
| 317 | +| `components/src/dynamo/sglang/main.py` | Component initialization, only Processor registers | |
| 318 | +| `components/src/dynamo/sglang/request_handlers/multimodal/processor_handler.py` | Processor implementation, OpenAI→SGLang | |
| 319 | +| `components/src/dynamo/sglang/request_handlers/multimodal/encode_worker_handler.py` | Vision encoder, embeddings generation | |
| 320 | +| `components/src/dynamo/sglang/request_handlers/multimodal/worker_handler.py` | PD/Prefill/Decode workers, NIXL read | |
| 321 | +| `components/src/dynamo/sglang/multimodal_utils/multimodal_chat_processor.py` | Chat template processing | |
| 322 | +| `components/src/dynamo/sglang/protocol.py` | Request/response data structures | |
| 323 | +| `components/src/dynamo/sglang/register.py` | Registration logic (only called for Processor) | |
| 324 | + |
0 commit comments