Skip to content

Conversation

@biswapanda
Copy link
Contributor

@biswapanda biswapanda commented Dec 8, 2025

Overview:

This PR enables KV aware routing for LoRA models in vLLM

Context

Previously kv event was disabled in vLLM when lora is enabled. (#4128)
there was a vllm bug in the KV cache block storage system where the code was incorrectly trying to access request.lora_request.id instead of the correct request.lora_request.adapter_id property.

Bug is fixed in vllm-project/vllm#27728 and fix is available in current vllm version in dynamo.

closes: DEP-546
related: DEP-588

Summary by CodeRabbit

  • New Features

    • Added new deployment script for orchestrating LoRA-enabled VLLM with router mode.
    • LoRA is now compatible with KV events functionality.
  • Chores

    • Updated example LoRA configurations with increased rank capacity and reference models.
    • Enhanced MinIO setup script with improved cleanup handling.

✏️ Tip: You can customize this high-level summary in your review settings.

@biswapanda biswapanda self-assigned this Dec 8, 2025
@biswapanda biswapanda requested review from a team as code owners December 8, 2025 22:37
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 8, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the feat label Dec 8, 2025
@biswapanda biswapanda force-pushed the bis/dep-546-lora-aware-kv-routing-support branch from 31b1a20 to 8409a5f Compare December 8, 2025 22:38
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 8, 2025

Walkthrough

The PR removes a LoRA-specific KV events guard in the args configuration, updates LoRA model references across deployment scripts from one model to another, increases LoRA max rank from 32 to 64, introduces a new orchestration script for multi-worker LoRA-enabled VLLM deployment, and adds cache cleanup in setup scripts.

Changes

Cohort / File(s) Summary
Core configuration
components/src/dynamo/vllm/args.py
Removed a multi-block guard and error path that prevented using KV events when LoRA was enabled, allowing KV events configuration to proceed without LoRA-specific blocking logic.
LoRA deployment scripts
examples/backends/vllm/launch/lora/agg_lora.sh, examples/backends/vllm/launch/lora/setup_minio.sh
Updated default LoRA model references from Neural-Hacker/Qwen3-Math-Reasoning-LoRA to codelion/Qwen3-0.6B-accuracy-recovery-lora; increased LoRA max rank from 32 to 64; updated inference test prompts; added cache directory cleanup.
New orchestration script
examples/backends/vllm/launch/lora/agg_lora_router.sh
Introduced new bash script that orchestrates local LoRA-enabled VLLM deployment with router mode frontend, two worker instances on separate CUDA devices, KV events configuration, and example curl commands for model listing, LoRA loading, and chat completion requests.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

  • The new orchestration script (agg_lora_router.sh) warrants attention to verify correct process orchestration, environment variable setup, and CUDA device assignment
  • The args.py guard removal should be cross-checked against the cited vLLM upstream bug to confirm resolution

Poem

🐰 LoRA guards gone, now KV events flow free,
Scripts multiplied, two workers dancing in harmony,
Qwen's accuracy blooms across the garden,
No cache clutter left—a tidy warren! ✨

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: enabling KV-aware routing for LoRA models in vLLM, which aligns with the primary objective of the PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed The pull request description follows the required template structure with Overview, Context, and issue references provided.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@biswapanda biswapanda enabled auto-merge (squash) December 8, 2025 22:40
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
examples/backends/vllm/launch/lora/agg_lora_router.sh (1)

82-94: Add explicit -X POST flag for consistency with other curl commands.

Line 82 curl command is missing the explicit -X POST flag. While POST is implied for payloads with -d, other curl invocations in this file (lines 46, 71, 76) use explicit -X POST for clarity and consistency.

Apply this diff:

- # Test LoRA inference
-curl localhost:8000/v1/chat/completions \
+# Test LoRA inference
+curl -X POST localhost:8000/v1/chat/completions \
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3fea2e1 and 8409a5f.

📒 Files selected for processing (4)
  • components/src/dynamo/vllm/args.py (0 hunks)
  • examples/backends/vllm/launch/lora/agg_lora.sh (2 hunks)
  • examples/backends/vllm/launch/lora/agg_lora_router.sh (1 hunks)
  • examples/backends/vllm/launch/lora/setup_minio.sh (3 hunks)
💤 Files with no reviewable changes (1)
  • components/src/dynamo/vllm/args.py
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 4644
File: lib/llm/src/lora/source.rs:313-323
Timestamp: 2025-11-26T22:06:18.789Z
Learning: In the Dynamo LoRA caching system (lib/llm/src/lora/source.rs), the local LoRA cache directory is considered ephemeral with S3 being the authoritative upstream source. If the local cache is removed or corrupted, it will be automatically rehydrated from S3, making complex three-step atomic swap patterns unnecessary for cache updates.
📚 Learning: 2025-11-26T22:06:18.789Z
Learnt from: biswapanda
Repo: ai-dynamo/dynamo PR: 4644
File: lib/llm/src/lora/source.rs:313-323
Timestamp: 2025-11-26T22:06:18.789Z
Learning: In the Dynamo LoRA caching system (lib/llm/src/lora/source.rs), the local LoRA cache directory is considered ephemeral with S3 being the authoritative upstream source. If the local cache is removed or corrupted, it will be automatically rehydrated from S3, making complex three-step atomic swap patterns unnecessary for cache updates.

Applied to files:

  • examples/backends/vllm/launch/lora/setup_minio.sh
  • examples/backends/vllm/launch/lora/agg_lora_router.sh
📚 Learning: 2025-12-02T18:13:40.065Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 4698
File: .github/workflows/container-validation-dynamo.yml:68-68
Timestamp: 2025-12-02T18:13:40.065Z
Learning: In the ai-dynamo/dynamo repository, backend-specific tests (vllm, sglang, trtllm) are intentionally excluded from the container-validation-dynamo.yml workflow using "not (vllm or sglang or trtllm)" because they run in a separate container-validation-backends.yml workflow that has dedicated jobs for each backend. This separation keeps framework-agnostic tests separate from backend-specific tests.

Applied to files:

  • examples/backends/vllm/launch/lora/agg_lora_router.sh
🪛 Shellcheck (0.11.0)
examples/backends/vllm/launch/lora/agg_lora_router.sh

[error] 99-99: Couldn't parse this brace group. Fix to allow more checks.

(SC1073)


[warning] 104-104: This { is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)


[error] 108-108: Expected a '}'. If you have one, try a ; or \n in front of it.

(SC1056)


[error] 108-108: Missing '}'. Fix any mentioned problems and try again.

(SC1072)


[error] 108-108: Unexpected tokens after compound command. Bad redirection or missing ;/&&/||/|?

(SC1141)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: sglang (amd64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: operator (arm64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: operator (amd64)
🔇 Additional comments (4)
examples/backends/vllm/launch/lora/setup_minio.sh (1)

23-24: ✓ Model references consistently updated and cache cleanup correctly placed.

The LoRA model updates from Neural-Hacker/Qwen3-Math-Reasoning-LoRA to codelion/Qwen3-0.6B-accuracy-recovery-lora are applied uniformly across variable assignments, help text, and upload paths. The addition of rm -rf "${TEMP_DIR}/.cache" aligns with the ephemeral cache design where S3 is the authoritative source, eliminating unnecessary local artifacts after download.

Also applies to: 66-67, 176-176

examples/backends/vllm/launch/lora/agg_lora.sh (1)

38-38: ✓ LoRA rank increase and model references are consistent.

The max-lora-rank increase from 32 to 64 and the model path updates to codelion/Qwen3-0.6B-accuracy-recovery-lora are consistently applied across load, inference, and unload operations. The updated test message (deep learning) aligns with the model's new purpose.

Also applies to: 46-49, 56-56, 72-72

examples/backends/vllm/launch/lora/agg_lora_router.sh (2)

99-130: Shellcheck parsing errors in example JSON output are false positives.

The reported shellcheck errors (SC1073, SC1083, SC1056, SC1072, SC1141) occur in lines 99–130, which is a commented-out sample JSON response. These are false positives because shellcheck is incorrectly attempting to parse the JSON object structure as bash syntax. The JSON itself is properly commented and non-executable, so no fix is needed.


28-28: ✓ KV routing setup and dual-worker configuration correctly configured.

The introduction of PYTHONHASHSEED=0 (line 28) ensures deterministic KV event ID generation for proper routing. The dual-worker setup (lines 42–61) with explicit KV events configuration on both workers enables the KV-aware LoRA routing that this PR addresses. The JSON configurations are properly quoted and will be correctly parsed at runtime. Process management (first worker in background, second in foreground) is appropriately handled with the EXIT trap.

Also applies to: 42-61

@biswapanda biswapanda force-pushed the bis/dep-546-lora-aware-kv-routing-support branch from 8409a5f to 27a1b4e Compare December 8, 2025 22:45
@biswapanda biswapanda force-pushed the bis/dep-546-lora-aware-kv-routing-support branch from 27a1b4e to 657702a Compare December 9, 2025 04:58
@biswapanda biswapanda removed the request for review from mohammedabdulwahhab December 9, 2025 20:30
@biswapanda biswapanda merged commit 8559c2b into main Dec 9, 2025
31 of 34 checks passed
@biswapanda biswapanda deleted the bis/dep-546-lora-aware-kv-routing-support branch December 9, 2025 21:04
ryanolson added a commit that referenced this pull request Dec 10, 2025
Merged 238 commits from main branch to bring the feature branch up to date.

Key conflicts resolved:
- Removed lib/kvbm-kernels references (deleted in main)
- Kept nova/nova-backend/kvbm workspace members from feature branch
- Maintained v2 module API refactoring from feature branch
- Updated Cargo.lock files to reflect new dependencies

Major updates from main include:
- LoRA support for vLLM (#4810)
- Multimodal documentation (#4510)
- Scaling adapter features (#4699, #4825)
- Tool calling support (#4822, #4722)
- NIXL connect improvements (#4433)

Signed-off-by: Ryan Olson <[email protected]>
zxue2 pushed a commit to zxue2/dynamo that referenced this pull request Dec 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants