Skip to content

[FEAT][RFC] Add router-side request queuing proposal#876

Open
ardecode wants to merge 1 commit intovllm-project:mainfrom
ardecode:feat/router-side-request-queuing-rfc
Open

[FEAT][RFC] Add router-side request queuing proposal#876
ardecode wants to merge 1 commit intovllm-project:mainfrom
ardecode:feat/router-side-request-queuing-rfc

Conversation

@ardecode
Copy link
Copy Markdown
Contributor

@ardecode ardecode commented Mar 7, 2026

Refs #855

Summary

This PR adds an RFC for router-side request queuing in the vLLM router.

The proposal introduces router-side admission control in front of backend replicas to:

  • smooth bursty traffic
  • keep backend vllm:num_requests_waiting shallow
  • provide bounded queueing behavior and explicit 429 overload responses

This PR is proposal-only.

What is included

  • Add proposals/router-side-request-queuing.md
  • Define the architecture for router-side request queueing
  • Scope Phase 1 to roundrobin routing only
  • Describe follow-up phases for session, then kvaware / prefixaware

Key design points in the RFC

  • Per-model FIFO queueing
  • Per-endpoint reservations / leases to avoid thundering herd behavior
  • Admission based primarily on backend num_requests_waiting
  • Lease release on first response chunk, or on pre-first-token error/cancel
  • 429 for queue full, queue timeout, and pinned-endpoint overload
  • ?id= requests are immediate admit/reject only in Phase 1

Why Phase 1 is limited to roundrobin

The RFC intentionally starts with roundrobin support only so the queue core can land without simultaneously deciding:

  • session stickiness vs fallback policy
  • LMCache locality vs FIFO fairness for kvaware / prefixaware

Those policy questions are deferred to later phases after the core queueing mechanism is validated.


  • Make sure the code changes pass the pre-commit checks.
  • Sign-off your commit by using -s when doing git commit
  • Try to classify PRs for easy understanding of the type of changes, such as [Bugfix], [Feat], and [CI].

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a formal Request for Comments (RFC) to enhance the vLLM router's resilience and traffic management capabilities. The core idea is to implement an admission control mechanism that queues incoming requests at the router level, preventing backends from becoming overwhelmed during traffic spikes. This approach aims to provide more predictable performance and clearer error signals to clients, laying the groundwork for more sophisticated routing policies in the future.

Highlights

  • New RFC Document: Added a new RFC document, proposals/router-side-request-queuing.md, outlining a proposal for router-side request queuing in the vLLM router.
  • Admission Control Layer: The proposal introduces an AdmissionController to act as a router-side admission control layer, smoothing bursty traffic and preventing backend overload.
  • Bounded Queuing and Overload Responses: The system will provide bounded queueing behavior and explicit 429 HTTP responses for queue full, queue timeout, and pinned-endpoint overload conditions.
  • Endpoint Leases: A key design point is the use of AdmissionLease bound to specific endpoints, which are released upon the first response chunk or on error/cancellation, to manage reservations and prevent thundering herd issues.
  • Phased Implementation: The initial implementation (Phase 1) is intentionally scoped to roundrobin routing only, with future phases planned for session, kvaware, and prefixaware routing.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • proposals/router-side-request-queuing.md
    • Added a new proposal document detailing the architecture, motivation, goals, and implementation plan for router-side request queuing.
Activity
  • Referenced GitHub Issue [Roadmap] vLLM Production Stack 2026 Roadmap #855, indicating this PR addresses the need for router-side request queuing support.
  • Explicitly stated that this PR is proposal-only, meaning it introduces documentation rather than code changes.
  • Provided a detailed summary of the proposal, including key design points and a phased implementation timeline.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a well-written and thorough RFC for router-side request queuing. The proposal is clear, well-scoped, and considers important aspects like performance, testing, and phased implementation. My feedback includes a couple of suggestions to further clarify details around client cancellation handling for queued requests and the expected reasons for the new rejection metric, which will help ensure the implementation aligns perfectly with the design goals.

Note: Security Review has been skipped due to the limited scope of the PR.

Comment on lines +86 to +89
- Queueing is enabled only for `roundrobin` routing.
- If router queueing is enabled with any other routing mode, router startup should fail with a validation error.
- Requests that target a specific endpoint via `?id=` are not queued in V1. They are admitted only if that endpoint is below the overload threshold; otherwise they are rejected immediately with `429`.
- If queueing is disabled, existing router behavior remains unchanged.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The proposal mentions request cancellation in the context of releasing an endpoint lease (line 82) and in the test plan (line 171), but it's not explicitly described how cancellations are handled for requests that are waiting in the queue (i.e., before a lease is granted). For completeness, I suggest adding a bullet point to this section to clarify this behavior. This will ensure the implementation covers this important edge case.

Suggested change
- Queueing is enabled only for `roundrobin` routing.
- If router queueing is enabled with any other routing mode, router startup should fail with a validation error.
- Requests that target a specific endpoint via `?id=` are not queued in V1. They are admitted only if that endpoint is below the overload threshold; otherwise they are rejected immediately with `429`.
- If queueing is disabled, existing router behavior remains unchanged.
- Queueing is enabled only for `roundrobin` routing.
- If router queueing is enabled with any other routing mode, router startup should fail with a validation error.
- Requests that target a specific endpoint via `?id=` are not queued in V1. They are admitted only if that endpoint is below the overload threshold; otherwise they are rejected immediately with `429`.
- Queued requests that are cancelled by the client before being admitted will be removed from the queue.
- If queueing is disabled, existing router behavior remains unchanged.

- `vllm_router:queued_requests`
- `vllm_router:queue_wait_seconds`
- `vllm_router:admissions_total`
- `vllm_router:rejections_total{reason}`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To make the vllm_router:rejections_total{reason} metric more concrete, it would be helpful to list the expected values for the reason label. The proposal mentions several rejection scenarios (queue full, queue timeout, pinned-endpoint overload). Explicitly listing them here would improve clarity for implementation and for users setting up monitoring.

Suggested change
- `vllm_router:rejections_total{reason}`
- `vllm_router:rejections_total{reason}` (e.g., reason="queue_full", "queue_timeout", "pinned_overload")

@ardecode
Copy link
Copy Markdown
Contributor Author

@ruizhang0101 can you kindly take a look at this PR? TIA!

Copy link
Copy Markdown
Collaborator

@ruizhang0101 ruizhang0101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Thanks a lot for the great proposal! From my perspective, this is very strong and solid.

Personally, I have some questions and comments regarding this RFC, hope it could be some help.

  1. For metrics scraper, will you use the current one or a new one? Since the current one is designed to run in a longer interval, it might need to be modified a bit to have a better performance when scraping in a faster interval.
  2. There is a failover mechanism which will re-route request to different endpoint, the lease might consider or be compatible with this scenario.
  3. It would be great to mention the concurrency control while using the lock in the admission controller or the queue wake-up design.
  4. I am wondering what will trigger the dequeue? Is it event-driven or is it gonna polling?
  5. What will happen to the queued requests while shutting down the service?

Again, nice proposal :))

@ardecode
Copy link
Copy Markdown
Contributor Author

Hi! Thanks a lot for the great proposal! From my perspective, this is very strong and solid.

Personally, I have some questions and comments regarding this RFC, hope it could be some help.

  1. For metrics scraper, will you use the current one or a new one? Since the current one is designed to run in a longer interval, it might need to be modified a bit to have a better performance when scraping in a faster interval.
  2. There is a failover mechanism which will re-route request to different endpoint, the lease might consider or be compatible with this scenario.
  3. It would be great to mention the concurrency control while using the lock in the admission controller or the queue wake-up design.
  4. I am wondering what will trigger the dequeue? Is it event-driven or is it gonna polling?
  5. What will happen to the queued requests while shutting down the service?

Again, nice proposal :))

  1. We plan to reuse the current EngineStatsScraper rather than introduce a second scraper service. The implementation will extend it with a queueing specific fast path when router queueing is enabled. This will allow us to keep the existing full scrape interval for the broader metric set, and add a faster admission oriented refresh focused on num_requests_waiting. That keeps backend metrics under one owner and avoids duplicated lifecycle management.

  2. If I am not wrong, the current failover logic only reroutes when process_request() raises before the stream is established. So leases only need to be failover compatible for the pre first token failure path? The lease will be endpoint scoped. If a request fails before first token and the router decides to reroute, the implementation will release the old endpoint lease first and then acquire a fresh lease for the next endpoint.

  3. The admission controller will use a single asyncio.Lock to protect queue state and per-endpoint reservation counts. It will not use a wake-all design. Each queued request will wait on its own Future, and dequeue/drain will run under the lock, grant leases, update reservations, and resolve only the futures that were actually admitted. Routing and network I/O will happen outside the lock.

  4. The queue will be mainly event driven. We will try to dequeue when capacity changes locally, such as on first token lease release, pre first token error, cancellation, or timeout cleanup. In addition, the fast admission scrape will provide a lightweight periodic refresh so queued requests can still move forward when backend capacity changes independently of the router. So the design is event driven with a small periodic refresh, not pure polling.

  5. On graceful shutdown, the router will stop admitting new queued requests, fail any requests still waiting in the queue, and release any undispatched leases. Requests that were already sent to a backend will follow the normal shutdown behavior. Since the queue is in memory, queued requests are not persisted across restart. We expect rejected queued requests to return a clear error, likely 503.

@ruizhang0101
Copy link
Copy Markdown
Collaborator

I see. Could you add these to the RFC doc as well? Also, Could you also add a system diagram for this feature? And don't forget to sign-off the commit :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants