-
Notifications
You must be signed in to change notification settings - Fork 737
feat: KV aware LoRA request routing for vllm #4810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
31b1a20 to
8409a5f
Compare
WalkthroughThe PR removes a LoRA-specific KV events guard in the args configuration, updates LoRA model references across deployment scripts from one model to another, increases LoRA max rank from 32 to 64, introduces a new orchestration script for multi-worker LoRA-enabled VLLM deployment, and adds cache cleanup in setup scripts. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
examples/backends/vllm/launch/lora/agg_lora_router.sh (1)
82-94: Add explicit-X POSTflag for consistency with other curl commands.Line 82 curl command is missing the explicit
-X POSTflag. While POST is implied for payloads with-d, other curl invocations in this file (lines 46, 71, 76) use explicit-X POSTfor clarity and consistency.Apply this diff:
- # Test LoRA inference -curl localhost:8000/v1/chat/completions \ +# Test LoRA inference +curl -X POST localhost:8000/v1/chat/completions \
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
components/src/dynamo/vllm/args.py(0 hunks)examples/backends/vllm/launch/lora/agg_lora.sh(2 hunks)examples/backends/vllm/launch/lora/agg_lora_router.sh(1 hunks)examples/backends/vllm/launch/lora/setup_minio.sh(3 hunks)
💤 Files with no reviewable changes (1)
- components/src/dynamo/vllm/args.py
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 4644
File: lib/llm/src/lora/source.rs:313-323
Timestamp: 2025-11-26T22:06:18.789Z
Learning: In the Dynamo LoRA caching system (lib/llm/src/lora/source.rs), the local LoRA cache directory is considered ephemeral with S3 being the authoritative upstream source. If the local cache is removed or corrupted, it will be automatically rehydrated from S3, making complex three-step atomic swap patterns unnecessary for cache updates.
📚 Learning: 2025-11-26T22:06:18.789Z
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 4644
File: lib/llm/src/lora/source.rs:313-323
Timestamp: 2025-11-26T22:06:18.789Z
Learning: In the Dynamo LoRA caching system (lib/llm/src/lora/source.rs), the local LoRA cache directory is considered ephemeral with S3 being the authoritative upstream source. If the local cache is removed or corrupted, it will be automatically rehydrated from S3, making complex three-step atomic swap patterns unnecessary for cache updates.
Applied to files:
examples/backends/vllm/launch/lora/setup_minio.shexamples/backends/vllm/launch/lora/agg_lora_router.sh
📚 Learning: 2025-12-02T18:13:40.065Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 4698
File: .github/workflows/container-validation-dynamo.yml:68-68
Timestamp: 2025-12-02T18:13:40.065Z
Learning: In the ai-dynamo/dynamo repository, backend-specific tests (vllm, sglang, trtllm) are intentionally excluded from the container-validation-dynamo.yml workflow using "not (vllm or sglang or trtllm)" because they run in a separate container-validation-backends.yml workflow that has dedicated jobs for each backend. This separation keeps framework-agnostic tests separate from backend-specific tests.
Applied to files:
examples/backends/vllm/launch/lora/agg_lora_router.sh
🪛 Shellcheck (0.11.0)
examples/backends/vllm/launch/lora/agg_lora_router.sh
[error] 99-99: Couldn't parse this brace group. Fix to allow more checks.
(SC1073)
[warning] 104-104: This { is literal. Check expression (missing ;/\n?) or quote it.
(SC1083)
[error] 108-108: Expected a '}'. If you have one, try a ; or \n in front of it.
(SC1056)
[error] 108-108: Missing '}'. Fix any mentioned problems and try again.
(SC1072)
[error] 108-108: Unexpected tokens after compound command. Bad redirection or missing ;/&&/||/|?
(SC1141)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: sglang (amd64)
- GitHub Check: vllm (arm64)
- GitHub Check: trtllm (arm64)
- GitHub Check: vllm (amd64)
- GitHub Check: trtllm (amd64)
- GitHub Check: operator (arm64)
- GitHub Check: sglang (arm64)
- GitHub Check: operator (amd64)
🔇 Additional comments (4)
examples/backends/vllm/launch/lora/setup_minio.sh (1)
23-24: ✓ Model references consistently updated and cache cleanup correctly placed.The LoRA model updates from Neural-Hacker/Qwen3-Math-Reasoning-LoRA to codelion/Qwen3-0.6B-accuracy-recovery-lora are applied uniformly across variable assignments, help text, and upload paths. The addition of
rm -rf "${TEMP_DIR}/.cache"aligns with the ephemeral cache design where S3 is the authoritative source, eliminating unnecessary local artifacts after download.Also applies to: 66-67, 176-176
examples/backends/vllm/launch/lora/agg_lora.sh (1)
38-38: ✓ LoRA rank increase and model references are consistent.The max-lora-rank increase from 32 to 64 and the model path updates to codelion/Qwen3-0.6B-accuracy-recovery-lora are consistently applied across load, inference, and unload operations. The updated test message (deep learning) aligns with the model's new purpose.
Also applies to: 46-49, 56-56, 72-72
examples/backends/vllm/launch/lora/agg_lora_router.sh (2)
99-130: Shellcheck parsing errors in example JSON output are false positives.The reported shellcheck errors (SC1073, SC1083, SC1056, SC1072, SC1141) occur in lines 99–130, which is a commented-out sample JSON response. These are false positives because shellcheck is incorrectly attempting to parse the JSON object structure as bash syntax. The JSON itself is properly commented and non-executable, so no fix is needed.
28-28: ✓ KV routing setup and dual-worker configuration correctly configured.The introduction of PYTHONHASHSEED=0 (line 28) ensures deterministic KV event ID generation for proper routing. The dual-worker setup (lines 42–61) with explicit KV events configuration on both workers enables the KV-aware LoRA routing that this PR addresses. The JSON configurations are properly quoted and will be correctly parsed at runtime. Process management (first worker in background, second in foreground) is appropriately handled with the EXIT trap.
Also applies to: 42-61
8409a5f to
27a1b4e
Compare
27a1b4e to
657702a
Compare
Merged 238 commits from main branch to bring the feature branch up to date. Key conflicts resolved: - Removed lib/kvbm-kernels references (deleted in main) - Kept nova/nova-backend/kvbm workspace members from feature branch - Maintained v2 module API refactoring from feature branch - Updated Cargo.lock files to reflect new dependencies Major updates from main include: - LoRA support for vLLM (#4810) - Multimodal documentation (#4510) - Scaling adapter features (#4699, #4825) - Tool calling support (#4822, #4722) - NIXL connect improvements (#4433) Signed-off-by: Ryan Olson <[email protected]>
Overview:
This PR enables KV aware routing for LoRA models in vLLM
Context
Previously kv event was disabled in vLLM when lora is enabled. (#4128)
there was a vllm bug in the KV cache block storage system where the code was incorrectly trying to access request.lora_request.id instead of the correct request.lora_request.adapter_id property.
Bug is fixed in vllm-project/vllm#27728 and fix is available in current vllm version in dynamo.
closes: DEP-546
related: DEP-588
Summary by CodeRabbit
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.