Skip to content

Commit 17af7a0

Browse files
authored
docs: disagg router docs update (#4093)
Signed-off-by: PeaBrane <[email protected]>
1 parent defe5de commit 17af7a0

File tree

2 files changed

+16
-15
lines changed

2 files changed

+16
-15
lines changed

benchmarks/router/README.md

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -118,23 +118,27 @@ python -m dynamo.frontend --help
118118

119119
For detailed explanations of router arguments (especially KV cache routing parameters), see the [KV Cache Routing documentation](../../docs/router/kv_cache_routing.md).
120120

121+
> [!Note]
122+
> If you're unsure whether your backend engines correctly emit KV events for certain models (e.g., hybrid models like gpt-oss or nemotron nano 2), use the `--no-kv-events` flag to disable KV event tracking and use approximate KV indexing instead:
123+
>
124+
> ```bash
125+
> python -m dynamo.frontend \
126+
> --router-mode kv \
127+
> --http-port 8000 \
128+
> --no-kv-events
129+
> ```
130+
121131
#### Disaggregated Serving with Automatic Prefill Routing
122132
123133
When you launch prefill workers using `run_engines.sh --prefill`, the frontend automatically detects them and activates an internal prefill router. This prefill router:
124134
- Automatically routes initial token processing to dedicated prefill workers
125-
- Uses KV-aware routing regardless of the frontend's `--router-mode` setting
135+
- Uses the same routing mode as the frontend's `--router-mode` setting
126136
- Seamlessly integrates with your decode workers for token generation
127137
128138
No additional configuration is needed - simply launch both decode and prefill workers, and the system handles the rest. See the [KV Cache Routing documentation](../../docs/router/kv_cache_routing.md#disaggregated-serving-prefill-and-decode) for more details.
129139
130-
**Note**: If you're unsure whether your backend engines correctly emit KV events for certain models (e.g., hybrid models like gpt-oss or nemotron nano 2), use the `--no-kv-events` flag to disable KV event tracking and use approximate KV indexing instead:
131-
132-
```bash
133-
python -m dynamo.frontend \
134-
--router-mode kv \
135-
--http-port 8000 \
136-
--no-kv-events
137-
```
140+
> [!Note]
141+
> The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh)
138142
139143
### Step 3: Verify Setup
140144

docs/router/kv_cache_routing.md

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -65,13 +65,14 @@ The prefill router is automatically created when:
6565
2. A prefill worker is detected with the same model name and `ModelType.Prefill`
6666

6767
**Key characteristics of the prefill router:**
68-
- **Always uses KV-aware routing** regardless of the frontend's `--router-mode` setting
6968
- **Always disables active block tracking** (`track_active_blocks=false`) since prefill workers don't perform decode
7069
- **Seamlessly integrated** into the request pipeline between preprocessing and decode routing
7170
- **Falls back gracefully** to decode-only mode if prefill fails or no prefill workers are available
7271

7372
### Setup Example
7473

74+
When both workers are registered, requests are automatically routed.
75+
7576
```python
7677
# Decode worker registration (in your decode worker)
7778
await register_llm(
@@ -92,12 +93,8 @@ await register_llm(
9293
)
9394
```
9495

95-
When both workers are registered, requests are automatically routed:
96-
1. **Prefill phase** → Prefill router selects best prefill worker (KV-aware)
97-
2. **Decode phase** → Decode router selects decode worker (uses frontend's `--router-mode`)
98-
9996
> [!Note]
100-
> **WIP**: Currently, the prefill router always uses KV routing. Future updates will provide more fine-grained control over prefill routing behavior to match user-specified frontend router modes.
97+
> The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh).
10198
10299
## Overview
103100

0 commit comments

Comments
 (0)