feat: KV aware LoRA request routing for vllm #4810

biswapanda · 2025-12-08T22:37:40Z

Overview:

This PR enables KV aware routing for LoRA models in vLLM

Context

Previously kv event was disabled in vLLM when lora is enabled. (#4128)
there was a vllm bug in the KV cache block storage system where the code was incorrectly trying to access request.lora_request.id instead of the correct request.lora_request.adapter_id property.

Bug is fixed in vllm-project/vllm#27728 and fix is available in current vllm version in dynamo.

closes: DEP-546
related: DEP-588

Summary by CodeRabbit

New Features
- Added new deployment script for orchestrating LoRA-enabled VLLM with router mode.
- LoRA is now compatible with KV events functionality.
Chores
- Updated example LoRA configurations with increased rank capacity and reference models.
- Enhanced MinIO setup script with improved cleanup handling.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

copy-pr-bot · 2025-12-08T22:37:44Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-12-08T22:38:41Z

Walkthrough

The PR removes a LoRA-specific KV events guard in the args configuration, updates LoRA model references across deployment scripts from one model to another, increases LoRA max rank from 32 to 64, introduces a new orchestration script for multi-worker LoRA-enabled VLLM deployment, and adds cache cleanup in setup scripts.

Changes

Cohort / File(s)	Summary
Core configuration `components/src/dynamo/vllm/args.py`	Removed a multi-block guard and error path that prevented using KV events when LoRA was enabled, allowing KV events configuration to proceed without LoRA-specific blocking logic.
LoRA deployment scripts `examples/backends/vllm/launch/lora/agg_lora.sh`, `examples/backends/vllm/launch/lora/setup_minio.sh`	Updated default LoRA model references from Neural-Hacker/Qwen3-Math-Reasoning-LoRA to codelion/Qwen3-0.6B-accuracy-recovery-lora; increased LoRA max rank from 32 to 64; updated inference test prompts; added cache directory cleanup.
New orchestration script `examples/backends/vllm/launch/lora/agg_lora_router.sh`	Introduced new bash script that orchestrates local LoRA-enabled VLLM deployment with router mode frontend, two worker instances on separate CUDA devices, KV events configuration, and example curl commands for model listing, LoRA loading, and chat completion requests.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

The new orchestration script (agg_lora_router.sh) warrants attention to verify correct process orchestration, environment variable setup, and CUDA device assignment
The args.py guard removal should be cross-checked against the cited vLLM upstream bug to confirm resolution

Poem

🐰 LoRA guards gone, now KV events flow free,
Scripts multiplied, two workers dancing in harmony,
Qwen's accuracy blooms across the garden,
No cache clutter left—a tidy warren! ✨

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: enabling KV-aware routing for LoRA models in vLLM, which aligns with the primary objective of the PR.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check	✅ Passed	The pull request description follows the required template structure with Overview, Context, and issue references provided.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

examples/backends/vllm/launch/lora/agg_lora_router.sh (1)
82-94: Add explicit -X POST flag for consistency with other curl commands.

Line 82 curl command is missing the explicit -X POST flag. While POST is implied for payloads with -d, other curl invocations in this file (lines 46, 71, 76) use explicit -X POST for clarity and consistency.

Apply this diff:
- # Test LoRA inference
-curl localhost:8000/v1/chat/completions \
+# Test LoRA inference
+curl -X POST localhost:8000/v1/chat/completions \

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3fea2e1 and 8409a5f.

📒 Files selected for processing (4)

components/src/dynamo/vllm/args.py (0 hunks)
examples/backends/vllm/launch/lora/agg_lora.sh (2 hunks)
examples/backends/vllm/launch/lora/agg_lora_router.sh (1 hunks)
examples/backends/vllm/launch/lora/setup_minio.sh (3 hunks)

💤 Files with no reviewable changes (1)

components/src/dynamo/vllm/args.py

🧰 Additional context used

🧠 Learnings (3)

📓 Common learnings

Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 4644
File: lib/llm/src/lora/source.rs:313-323
Timestamp: 2025-11-26T22:06:18.789Z
Learning: In the Dynamo LoRA caching system (lib/llm/src/lora/source.rs), the local LoRA cache directory is considered ephemeral with S3 being the authoritative upstream source. If the local cache is removed or corrupted, it will be automatically rehydrated from S3, making complex three-step atomic swap patterns unnecessary for cache updates.

📚 Learning: 2025-11-26T22:06:18.789Z

Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 4644
File: lib/llm/src/lora/source.rs:313-323
Timestamp: 2025-11-26T22:06:18.789Z
Learning: In the Dynamo LoRA caching system (lib/llm/src/lora/source.rs), the local LoRA cache directory is considered ephemeral with S3 being the authoritative upstream source. If the local cache is removed or corrupted, it will be automatically rehydrated from S3, making complex three-step atomic swap patterns unnecessary for cache updates.

Applied to files:

examples/backends/vllm/launch/lora/setup_minio.sh
examples/backends/vllm/launch/lora/agg_lora_router.sh

📚 Learning: 2025-12-02T18:13:40.065Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 4698
File: .github/workflows/container-validation-dynamo.yml:68-68
Timestamp: 2025-12-02T18:13:40.065Z
Learning: In the ai-dynamo/dynamo repository, backend-specific tests (vllm, sglang, trtllm) are intentionally excluded from the container-validation-dynamo.yml workflow using "not (vllm or sglang or trtllm)" because they run in a separate container-validation-backends.yml workflow that has dedicated jobs for each backend. This separation keeps framework-agnostic tests separate from backend-specific tests.

Applied to files:

examples/backends/vllm/launch/lora/agg_lora_router.sh

🪛 Shellcheck (0.11.0)

examples/backends/vllm/launch/lora/agg_lora_router.sh

[error] 99-99: Couldn't parse this brace group. Fix to allow more checks.

(SC1073)

[warning] 104-104: This { is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)

[error] 108-108: Expected a '}'. If you have one, try a ; or \n in front of it.

(SC1056)

[error] 108-108: Missing '}'. Fix any mentioned problems and try again.

(SC1072)

[error] 108-108: Unexpected tokens after compound command. Bad redirection or missing ;/&&/||/|?

(SC1141)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)

GitHub Check: sglang (amd64)
GitHub Check: vllm (arm64)
GitHub Check: trtllm (arm64)
GitHub Check: vllm (amd64)
GitHub Check: trtllm (amd64)
GitHub Check: operator (arm64)
GitHub Check: sglang (arm64)
GitHub Check: operator (amd64)

🔇 Additional comments (4)

examples/backends/vllm/launch/lora/setup_minio.sh (1)

23-24: ✓ Model references consistently updated and cache cleanup correctly placed.

The LoRA model updates from Neural-Hacker/Qwen3-Math-Reasoning-LoRA to codelion/Qwen3-0.6B-accuracy-recovery-lora are applied uniformly across variable assignments, help text, and upload paths. The addition of rm -rf "${TEMP_DIR}/.cache" aligns with the ephemeral cache design where S3 is the authoritative source, eliminating unnecessary local artifacts after download.

Also applies to: 66-67, 176-176

examples/backends/vllm/launch/lora/agg_lora.sh (1)

38-38: ✓ LoRA rank increase and model references are consistent.

The max-lora-rank increase from 32 to 64 and the model path updates to codelion/Qwen3-0.6B-accuracy-recovery-lora are consistently applied across load, inference, and unload operations. The updated test message (deep learning) aligns with the model's new purpose.

Also applies to: 46-49, 56-56, 72-72

examples/backends/vllm/launch/lora/agg_lora_router.sh (2)

99-130: Shellcheck parsing errors in example JSON output are false positives.

The reported shellcheck errors (SC1073, SC1083, SC1056, SC1072, SC1141) occur in lines 99–130, which is a commented-out sample JSON response. These are false positives because shellcheck is incorrectly attempting to parse the JSON object structure as bash syntax. The JSON itself is properly commented and non-executable, so no fix is needed.

28-28: ✓ KV routing setup and dual-worker configuration correctly configured.

The introduction of PYTHONHASHSEED=0 (line 28) ensures deterministic KV event ID generation for proper routing. The dual-worker setup (lines 42–61) with explicit KV events configuration on both workers enables the KV-aware LoRA routing that this PR addresses. The JSON configurations are properly quoted and will be correctly parsed at runtime. Process management (first worker in background, second in foreground) is appropriately handled with the EXIT trap.

Also applies to: 42-61

examples/backends/vllm/launch/lora/agg_lora_router.sh

Merged 238 commits from main branch to bring the feature branch up to date. Key conflicts resolved: - Removed lib/kvbm-kernels references (deleted in main) - Kept nova/nova-backend/kvbm workspace members from feature branch - Maintained v2 module API refactoring from feature branch - Updated Cargo.lock files to reflect new dependencies Major updates from main include: - LoRA support for vLLM (#4810) - Multimodal documentation (#4510) - Scaling adapter features (#4699, #4825) - Tool calling support (#4822, #4722) - NIXL connect improvements (#4433) Signed-off-by: Ryan Olson <[email protected]>

biswapanda self-assigned this Dec 8, 2025

biswapanda requested review from a team as code owners December 8, 2025 22:37

pull-request-size bot added the size/L label Dec 8, 2025

github-actions bot added the feat label Dec 8, 2025

biswapanda force-pushed the bis/dep-546-lora-aware-kv-routing-support branch from 31b1a20 to 8409a5f Compare December 8, 2025 22:38

biswapanda requested a review from mohammedabdulwahhab December 8, 2025 22:40

biswapanda enabled auto-merge (squash) December 8, 2025 22:40

coderabbitai bot reviewed Dec 8, 2025

View reviewed changes

examples/backends/vllm/launch/lora/agg_lora_router.sh Outdated Show resolved Hide resolved

examples/backends/vllm/launch/lora/agg_lora_router.sh Show resolved Hide resolved

biswapanda force-pushed the bis/dep-546-lora-aware-kv-routing-support branch from 8409a5f to 27a1b4e Compare December 8, 2025 22:45

copy-pr-bot bot had a problem deploying to GITLAB December 8, 2025 22:45 Failure

biswapanda added 2 commits December 8, 2025 20:57

feat: enable kv aware routing for loras

687582c

fix

657702a

biswapanda force-pushed the bis/dep-546-lora-aware-kv-routing-support branch from 27a1b4e to 657702a Compare December 9, 2025 04:58

copy-pr-bot bot had a problem deploying to GITLAB December 9, 2025 04:58 Failure

Merge branch 'main' into bis/dep-546-lora-aware-kv-routing-support

f22e215

copy-pr-bot bot had a problem deploying to GITLAB December 9, 2025 04:58 Failure

grahamking approved these changes Dec 9, 2025

View reviewed changes

dmitry-tokarev-nv approved these changes Dec 9, 2025

View reviewed changes

copy-pr-bot bot had a problem deploying to GITLAB December 9, 2025 20:30 Failure

biswapanda removed the request for review from mohammedabdulwahhab December 9, 2025 20:30

biswapanda merged commit 8559c2b into main Dec 9, 2025
31 of 34 checks passed

biswapanda deleted the bis/dep-546-lora-aware-kv-routing-support branch December 9, 2025 21:04

zxue2 pushed a commit to zxue2/dynamo that referenced this pull request Dec 11, 2025

feat: KV aware LoRA request routing for vllm (ai-dynamo#4810)

38c8617

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: KV aware LoRA request routing for vllm #4810

feat: KV aware LoRA request routing for vllm #4810

Uh oh!

biswapanda commented Dec 8, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Dec 8, 2025

Uh oh!

coderabbitai bot commented Dec 8, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: KV aware LoRA request routing for vllm #4810

feat: KV aware LoRA request routing for vllm #4810

Uh oh!

Conversation

biswapanda commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Context

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Dec 8, 2025

Uh oh!

coderabbitai bot commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

biswapanda commented Dec 8, 2025 •

edited

Loading

coderabbitai bot commented Dec 8, 2025 •

edited

Loading