You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: components/src/dynamo/frontend/main.py
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -197,9 +197,9 @@ def parse_args():
197
197
)
198
198
parser.add_argument(
199
199
"--active-prefill-tokens-threshold",
200
-
type=float,
200
+
type=int,
201
201
default=None,
202
-
help="Threshold percentage for determining when a worker is considered busy based on prefill token utilization. Can exceed 1.0 since active prefill tokens include queued tokens. If not set, tokens-based busy detection is disabled.",
202
+
help="Literal token count threshold for determining when a worker is considered busy based on prefill token utilization. When active prefill tokens exceed this threshold, the worker is marked as busy. If not set, tokens-based busy detection is disabled.",
Copy file name to clipboardExpand all lines: docs/router/kv_cache_routing.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -33,7 +33,7 @@ The main KV-aware routing arguments:
33
33
34
34
-`--active-decode-blocks-threshold`: Initial threshold (0.0-1.0) for determining when a worker is considered busy based on KV cache block utilization. When a worker's KV cache active blocks exceed this percentage of total blocks, it will be marked as busy and excluded from routing. If not set, blocks-based busy detection is disabled. This feature works with all routing modes (`--router-mode kv|round-robin|random`) as long as backend engines emit `ForwardPassMetrics`. The threshold can be dynamically updated at runtime via the `/busy_threshold` HTTP endpoint (see [Dynamic Threshold Configuration](#dynamic-threshold-configuration)).
35
35
36
-
-`--active-prefill-tokens-threshold`: Threshold for determining when a worker is considered busy based on prefill token utilization. Can exceed 1.0 since active prefill tokens include queued tokens (pending prefill work). If not set, tokens-based busy detection is disabled. When set, the router checks if active prefill tokens exceed `threshold * max_num_batch_tokens`. Generally, set this higher than 1.0 to account for queued requests.
36
+
-`--active-prefill-tokens-threshold`: Literal token count threshold for determining when a worker is considered busy based on prefill token utilization. When active prefill tokens exceed this threshold, the worker is marked as busy. If not set, tokens-based busy detection is disabled.
37
37
38
38
-`--router-ttl`: Time-to-live in seconds for blocks in the router's local cache predictions. Blocks older than this duration will be automatically expired and removed from the router's radix tree. Defaults to 120.0 seconds when `--no-kv-events` is used. This helps manage memory usage by removing stale cache predictions that are unlikely to be accurate.
39
39
@@ -594,8 +594,8 @@ The busy thresholds can be updated at runtime without restarting the frontend. T
594
594
# Set both thresholds for a model
595
595
curl -X POST http://localhost:8000/busy_threshold \
0 commit comments