Skip to content

Conversation

@jiashaokun-1
Copy link

@jiashaokun-1 jiashaokun-1 commented Nov 13, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @jiashaokun-1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the router's ability to manage and distribute requests across data parallel workers. By introducing a new scheduling strategy that considers the minimum number of tokens on individual data parallel ranks, it aims to optimize resource utilization and improve the efficiency of prefill and decode operations. The changes involve extending API parameters, updating internal data structures to track per-rank loads, and modifying the core routing logic to leverage this new load information for more intelligent request distribution.

Highlights

  • Data Parallel Rank Routing: Introduced a new mechanism to route requests to specific data parallel (DP) ranks within workers based on their current token load. This allows for more granular load balancing in data-parallel setups.
  • Configuration Options: Added a new router configuration option, dp_minimum_tokens_scheduler, to enable this new minimum token-based scheduling for data parallel groups. Also introduced worker_load_check_interval to control how frequently worker loads are checked.
  • API and Internal Data Structures Update: Extended various request and protocol structures (e.g., GenerateReqInput, ChatCompletionRequest, GenerateRequest) with a decode_dp_rank field, alongside the existing data_parallel_rank, to differentiate routing for prefill and decode stages.
  • Worker Load Management: The WorkerManager and LoadBalancingPolicy implementations (CacheAware, PowerOfTwo, Random, RoundRobin) were updated to fetch, store, and utilize per-DP-rank load information, rather than just aggregate worker load. A new DPLoadManager was introduced to handle this DP-specific load caching and incrementing.
  • Prefill-Decode Router Integration: The PDRouter now incorporates the logic to select the appropriate data_parallel_rank and decode_dp_rank for requests by querying the load balancing policies for the ranks with the lowest token count, when the dp_minimum_tokens_scheduler is enabled.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new router scheduling strategy to select a data parallel (DP) group based on the one with the minimum number of tokens. This is a significant feature for load balancing in distributed setups. The changes are extensive, correctly propagating new configuration options and logic through both the Python and Rust codebases. The introduction of a reusable DPLoadManager in Rust is a good design choice for managing DP loads. I've identified a minor bug where a statement was likely intended to be a log message, and some unreachable code that can be simplified. Overall, this is a solid and valuable contribution.

self.workers[req.decode_dp_rank].send_pyobj(req)
return True
if req.data_parallel_rank is not None:
(f"Decode direct routing to DP rank {req.decode_dp_rank}, by data parallel rank")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This f-string is a statement with no effect and seems to be a bug. It was likely intended to be a logger.debug call. Additionally, it references req.decode_dp_rank while this code block routes based on req.data_parallel_rank, which could be confusing. I've corrected it to log req.data_parallel_rank to match the routing logic.

Suggested change
(f"Decode direct routing to DP rank {req.decode_dp_rank}, by data parallel rank")
logger.debug(f"Decode direct routing to DP rank {req.data_parallel_rank}, by data parallel rank")

Comment on lines +430 to +434
if let Some(policy_registry) = &self.policy_registry {
policy_registry.enable_dp_minimum_tokens_scheduler();
} else {
info!("policy_registry is None")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since self.policy_registry is assigned a Some(...) value on line 428, the if let Some(...) on line 430 will always succeed. This makes the else block unreachable dead code. You can simplify this block to a direct call.

            self.policy_registry.as_ref().unwrap().enable_dp_minimum_tokens_scheduler();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant