-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Receive requests in a separate Scheduler thread #13986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Feng Su <[email protected]>
Summary of ChangesHello @vipwangerxiao, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the SGLang Scheduler's ability to handle large multimodal data and numerous requests by introducing a dedicated background thread for receiving incoming requests. This change decouples the request reception process from the main scheduler loop, preventing it from being blocked by I/O operations and allowing the main thread to focus on core scheduling tasks. The modification aims to improve overall system responsiveness and has shown measurable gains in throughput and latency. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a separate thread for receiving requests in the scheduler, which is a good approach to improve performance by offloading I/O work from the main scheduler loop. The implementation using a queue.Queue for communication between the threads is sound.
My review focuses on improving the robustness of the new receiver thread and general code quality. I've identified a few areas for improvement:
- The exception handling in the new receiver thread is too broad, which could cause the thread to terminate silently on unexpected errors, making the server unresponsive. I've suggested more specific exception handling to make it more robust.
- There are a couple of instances of bare
except:clauses, which can hide bugs. I've recommended replacing them with specific exceptions.
These changes should make the new implementation more resilient and easier to debug.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Motivation
When transmitting larger multimodal data and handling numerous requests, the Scheduler will spend more time receiving requests. By using a separate thread to handle data reception, the main thread of the Scheduler does not need to handle all the requests at once. Also like what was mentioned in #6189
In specific scenarios, we can achieve a 4% to 5% improvement in throughput and TTFT/E2E latency
Modifications
Use a separate thread to handle request reception in the Scheduler.
Accuracy Tests
python3 -m sglang.test.few_shot_gsm8k --num-questions 200Benchmarking and Profiling
Launch server command on NVIDIA 5090
python3 -m sglang.launch_server --model-path Qwen/Qwen3-VL-4B-Instruct-FP8 --enable-multimodal --cuda-graph-max-bs 128 --context-length 2560 --page-size 16 --stream-interval 300 --mem-fraction-static 0.7 --port 30260 --base-gpu-id 0 --disable-radix-cache --mm-attention-backend sdpa --kv-cache-dtype fp8_e4m3Launch client commend
python3 -m sglang.bench_serving --backend sglang --dataset-name image --model Qwen/Qwen3-VL-4B-Instruct-FP8 --port 30260 --random-input-len 100 --random-output-len 20 --image-count 1 --image-resolution 1200x1200 --num-prompts 96 --request-rate inf --max-concurrency 128 --random-range-ratio 1Before
After
Checklist