Skip to content

Conversation

@vipwangerxiao
Copy link
Contributor

@vipwangerxiao vipwangerxiao commented Nov 26, 2025

Motivation

When transmitting larger multimodal data and handling numerous requests, the Scheduler will spend more time receiving requests. By using a separate thread to handle data reception, the main thread of the Scheduler does not need to handle all the requests at once. Also like what was mentioned in #6189

In specific scenarios, we can achieve a 4% to 5% improvement in throughput and TTFT/E2E latency

image

Modifications

Use a separate thread to handle request reception in the Scheduler.

Accuracy Tests

python3 -m sglang.test.few_shot_gsm8k --num-questions 200

Accuracy: 0.910
Invalid: 0.000
Latency: 35.407 s
Output throughput: 829.346 token/s

Benchmarking and Profiling

Before After Performance gain
Request throughput (req/s) 3.86 4.04 +4.7%
P99 TTFT (ms) 24407.08 23275.55 -4.6%
P99 ITL (ms) 880.34 875.64 -0.6%
Mean E2E Latency (ms) 24109.25 23202.49 -3.8%

Launch server command on NVIDIA 5090

python3 -m sglang.launch_server --model-path Qwen/Qwen3-VL-4B-Instruct-FP8 --enable-multimodal --cuda-graph-max-bs 128 --context-length 2560 --page-size 16 --stream-interval 300 --mem-fraction-static 0.7 --port 30260 --base-gpu-id 0 --disable-radix-cache --mm-attention-backend sdpa --kv-cache-dtype fp8_e4m3

Launch client commend

python3 -m sglang.bench_serving --backend sglang --dataset-name image --model Qwen/Qwen3-VL-4B-Instruct-FP8 --port 30260 --random-input-len 100 --random-output-len 20 --image-count 1 --image-resolution 1200x1200 --num-prompts 96 --request-rate inf --max-concurrency 128 --random-range-ratio 1

Before

============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 128
Successful requests: 96
Benchmark duration (s): 24.86
Total input tokens: 149781
Total input text tokens: 10965
Total input vision tokens: 138816
Total generated tokens: 1920
Total generated tokens (retokenized): 1913
Request throughput (req/s): 3.86
Input token throughput (tok/s): 6025.07
Output token throughput (tok/s): 77.23
Total token throughput (tok/s): 6102.31
Concurrency: 93.10
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 24109.25
Median E2E Latency (ms): 24676.35
---------------Time to First Token----------------
Mean TTFT (ms): 16298.84
Median TTFT (ms): 16341.09
P99 TTFT (ms): 24407.08
---------------Inter-Token Latency----------------
Mean ITL (ms): 415.40
Median ITL (ms): 413.84
P95 ITL (ms): 836.74
P99 ITL (ms): 880.34
Max ITL (ms): 880.36
==================================================

After

============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 128
Successful requests: 96
Benchmark duration (s): 23.76
Total input tokens: 149624
Total input text tokens: 10808
Total input vision tokens: 138816
Total generated tokens: 1920
Total generated tokens (retokenized): 1920
Request throughput (req/s): 4.04
Input token throughput (tok/s): 6297.53
Output token throughput (tok/s): 80.81
Total token throughput (tok/s): 6378.34
Concurrency: 93.75
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 23202.49
Median E2E Latency (ms): 23566.99
---------------Time to First Token----------------
Mean TTFT (ms): 15375.16
Median TTFT (ms): 15518.77
P99 TTFT (ms): 23275.55
---------------Inter-Token Latency----------------
Mean ITL (ms): 411.96
Median ITL (ms): 411.09
P95 ITL (ms): 799.90
P99 ITL (ms): 875.64
Max ITL (ms): 896.44
==================================================

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @vipwangerxiao, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the SGLang Scheduler's ability to handle large multimodal data and numerous requests by introducing a dedicated background thread for receiving incoming requests. This change decouples the request reception process from the main scheduler loop, preventing it from being blocked by I/O operations and allowing the main thread to focus on core scheduling tasks. The modification aims to improve overall system responsiveness and has shown measurable gains in throughput and latency.

Highlights

  • Offloaded Request Reception: The main scheduler thread no longer directly handles receiving requests from tokenizer and RPC sockets, improving its responsiveness.
  • Dedicated Receiving Thread: A new background thread has been introduced to continuously poll and receive incoming requests from tokenizer and RPC sockets.
  • Performance Improvement: Benchmarking results demonstrate a 4-5% improvement in request throughput and a reduction in Time-to-First-Token (TTFT) and End-to-End (E2E) latency in specific scenarios.
  • Queue-Based Communication: A queue.Queue is utilized for safe and efficient communication, allowing the dedicated receiving thread to pass requests to the main scheduler thread.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a separate thread for receiving requests in the scheduler, which is a good approach to improve performance by offloading I/O work from the main scheduler loop. The implementation using a queue.Queue for communication between the threads is sound.

My review focuses on improving the robustness of the new receiver thread and general code quality. I've identified a few areas for improvement:

  1. The exception handling in the new receiver thread is too broad, which could cause the thread to terminate silently on unexpected errors, making the server unresponsive. I've suggested more specific exception handling to make it more robust.
  2. There are a couple of instances of bare except: clauses, which can hide bugs. I've recommended replacing them with specific exceptions.

These changes should make the new implementation more resilient and easier to debug.

vipwangerxiao and others added 3 commits November 27, 2025 09:45
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant