Receive requests in a separate Scheduler thread #13986

vipwangerxiao · 2025-11-26T12:00:03Z

Motivation

When transmitting larger multimodal data and handling numerous requests, the Scheduler will spend more time receiving requests. By using a separate thread to handle data reception, the main thread of the Scheduler does not need to handle all the requests at once. Also like what was mentioned in #6189

In specific scenarios, we can achieve a 4% to 5% improvement in throughput and TTFT/E2E latency

Modifications

Use a separate thread to handle request reception in the Scheduler.

Accuracy Tests

python3 -m sglang.test.few_shot_gsm8k --num-questions 200

Accuracy: 0.910
Invalid: 0.000
Latency: 35.407 s
Output throughput: 829.346 token/s

Benchmarking and Profiling

	Before	After	Performance gain
Request throughput (req/s)	3.86	4.04	+4.7%
P99 TTFT (ms)	24407.08	23275.55	-4.6%
P99 ITL (ms)	880.34	875.64	-0.6%
Mean E2E Latency (ms)	24109.25	23202.49	-3.8%

Launch server command on NVIDIA 5090

python3 -m sglang.launch_server --model-path Qwen/Qwen3-VL-4B-Instruct-FP8 --enable-multimodal --cuda-graph-max-bs 128 --context-length 2560 --page-size 16 --stream-interval 300 --mem-fraction-static 0.7 --port 30260 --base-gpu-id 0 --disable-radix-cache --mm-attention-backend sdpa --kv-cache-dtype fp8_e4m3

Launch client commend

python3 -m sglang.bench_serving --backend sglang --dataset-name image --model Qwen/Qwen3-VL-4B-Instruct-FP8 --port 30260 --random-input-len 100 --random-output-len 20 --image-count 1 --image-resolution 1200x1200 --num-prompts 96 --request-rate inf --max-concurrency 128 --random-range-ratio 1

Before

============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 128
Successful requests: 96
Benchmark duration (s): 24.86
Total input tokens: 149781
Total input text tokens: 10965
Total input vision tokens: 138816
Total generated tokens: 1920
Total generated tokens (retokenized): 1913
Request throughput (req/s): 3.86
Input token throughput (tok/s): 6025.07
Output token throughput (tok/s): 77.23
Total token throughput (tok/s): 6102.31
Concurrency: 93.10
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 24109.25
Median E2E Latency (ms): 24676.35
---------------Time to First Token----------------
Mean TTFT (ms): 16298.84
Median TTFT (ms): 16341.09
P99 TTFT (ms): 24407.08
---------------Inter-Token Latency----------------
Mean ITL (ms): 415.40
Median ITL (ms): 413.84
P95 ITL (ms): 836.74
P99 ITL (ms): 880.34
Max ITL (ms): 880.36
==================================================

After

============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 128
Successful requests: 96
Benchmark duration (s): 23.76
Total input tokens: 149624
Total input text tokens: 10808
Total input vision tokens: 138816
Total generated tokens: 1920
Total generated tokens (retokenized): 1920
Request throughput (req/s): 4.04
Input token throughput (tok/s): 6297.53
Output token throughput (tok/s): 80.81
Total token throughput (tok/s): 6378.34
Concurrency: 93.75
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 23202.49
Median E2E Latency (ms): 23566.99
---------------Time to First Token----------------
Mean TTFT (ms): 15375.16
Median TTFT (ms): 15518.77
P99 TTFT (ms): 23275.55
---------------Inter-Token Latency----------------
Mean ITL (ms): 411.96
Median ITL (ms): 411.09
P95 ITL (ms): 799.90
P99 ITL (ms): 875.64
Max ITL (ms): 896.44
==================================================

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

Co-authored-by: Feng Su <[email protected]>

gemini-code-assist · 2025-11-26T12:00:23Z

Summary of Changes

Hello @vipwangerxiao, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the SGLang Scheduler's ability to handle large multimodal data and numerous requests by introducing a dedicated background thread for receiving incoming requests. This change decouples the request reception process from the main scheduler loop, preventing it from being blocked by I/O operations and allowing the main thread to focus on core scheduling tasks. The modification aims to improve overall system responsiveness and has shown measurable gains in throughput and latency.

Highlights

Offloaded Request Reception: The main scheduler thread no longer directly handles receiving requests from tokenizer and RPC sockets, improving its responsiveness.
Dedicated Receiving Thread: A new background thread has been introduced to continuously poll and receive incoming requests from tokenizer and RPC sockets.
Performance Improvement: Benchmarking results demonstrate a 4-5% improvement in request throughput and a reduction in Time-to-First-Token (TTFT) and End-to-End (E2E) latency in specific scenarios.
Queue-Based Communication: A queue.Queue is utilized for safe and efficient communication, allowing the dedicated receiving thread to pass requests to the main scheduler thread.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a separate thread for receiving requests in the scheduler, which is a good approach to improve performance by offloading I/O work from the main scheduler loop. The implementation using a queue.Queue for communication between the threads is sound.

My review focuses on improving the robustness of the new receiver thread and general code quality. I've identified a few areas for improvement:

The exception handling in the new receiver thread is too broad, which could cause the thread to terminate silently on unexpected errors, making the server unresponsive. I've suggested more specific exception handling to make it more robust.
There are a couple of instances of bare except: clauses, which can hide bugs. I've recommended replacing them with specific exceptions.

These changes should make the new implementation more resilient and easier to debug.

python/sglang/srt/managers/scheduler.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Receive requests in a separate Scheduler thread

05eee68

Co-authored-by: Feng Su <[email protected]>

vipwangerxiao requested review from Ying1123, hnyls2002, merrymercy, xiezhq-hermann and zhyncs as code owners November 26, 2025 12:00

gemini-code-assist bot reviewed Nov 26, 2025

View reviewed changes

python/sglang/srt/managers/scheduler.py Show resolved Hide resolved

python/sglang/srt/managers/scheduler.py Outdated Show resolved Hide resolved

python/sglang/srt/managers/scheduler.py Outdated Show resolved Hide resolved

vipwangerxiao and others added 3 commits November 27, 2025 09:45

Apply suggestion from @gemini-code-assist[bot]

87842a5

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Apply suggestion from @gemini-code-assist[bot]

ea36b45

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Apply suggestion from @gemini-code-assist[bot]

089ed2e

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Receive requests in a separate Scheduler thread #13986

Receive requests in a separate Scheduler thread #13986

Uh oh!

vipwangerxiao commented Nov 26, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Receive requests in a separate Scheduler thread #13986

Are you sure you want to change the base?

Receive requests in a separate Scheduler thread #13986

Uh oh!

Conversation

vipwangerxiao commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Nov 26, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vipwangerxiao commented Nov 26, 2025 •

edited

Loading