feat: unified tracing across dynamo + sglang #4248

ishandhanani · 2025-11-12T03:27:14Z

Summary

Implements distributed trace propagation from Dynamo to SGLang, enabling end-to-end request tracing across the stack: dynamo-frontend → dynamo-worker → sglang{agg,disagg}.

Changes

Core Implementation

handler_base.py

Added _propagate_trace_context_to_sglang() method to BaseWorkerHandler
Extracts Dynamo's trace context (trace_id, span_id) from the Context object
Builds W3C traceparent and propagates to SGLang via trace_set_remote_propagate_context()

decode_handler.py & prefill_handler.py

Extract trace_id from context.trace_id
Call _propagate_trace_context_to_sglang(context, bootstrap_room) before engine.async_generate()
Pass rid=trace_id to SGLang engine (requires feat(engine): add rid parameter to methods in Engine class sgl-project/sglang#13095 to be released)

How It Works

SGLang Trace Context Propagation

SGLang supports trace propagation via trace_set_remote_propagate_context(), which stores trace context keyed by bootstrap_room:

# Build trace context in SGLang's expected format
trace_context = {
    str(bootstrap_room): {
        "root_span": {"traceparent": f"00-{trace_id}-{span_id}-01"},
        "prev_span": {
            "span_id": int(span_id, 16),
            "trace_id": int(trace_id, 16),
        }
    }
}

When SGLang's trace_req_start() runs, it checks remote_trace_contexts[str(bootstrap_room)] and uses our context as the parent span.

Trace Flow

Python handler receives Context, extracts trace_id/span_id
Before calling engine.async_generate(), propagate trace context to SGLang
SGLang's trace_req_start() finds context in remote_trace_contexts
SGLang creates child spans under Dynamo's parent span
All spans appear under same trace_id in Jaeger/Tempo

Result

  dynamo-frontend (trace_id: abc123)
  └─ dynamo-worker (same trace_id)
     └─ sglang Req abc1...
        ├─ TOKENIZER_DISPATCH
        ├─ SCHEDULER_INPUT_PROCESSING
        └─ DECODE

Example (disagg)

Adds `--enable-otel` command-line flag to four SGLang launch scripts for easy OpenTelemetry tracing enablement. When enabled, the flag: - Exports DYN_LOGGING_JSONL=true for JSON logging - Exports OTEL_EXPORT_ENABLED=1 to enable trace export - Sets OTEL_EXPORTER_OTLP_TRACES_ENDPOINT (defaults to http://localhost:4317) - Sets DYN_SYSTEM_PORT=8081 to enable system metrics server - Prefixes each process with OTEL_SERVICE_NAME for trace attribution Modified scripts: - examples/backends/sglang/launch/agg.sh - examples/backends/sglang/launch/disagg.sh - examples/backends/sglang/launch/multimodal_agg.sh - examples/backends/sglang/launch/multimodal_disagg.sh Usage: ./agg.sh --enable-otel Note: Uses DYN_SYSTEM_PORT instead of deprecated DYN_SYSTEM_ENABLED per #4082

Adds --enable-metrics flag to agg.sh and disagg.sh with automatic port allocation to avoid conflicts when multiple processes run on the same node. Changes: - Added --enable-metrics flag to control system metrics server - Automatic port allocation: - agg.sh: frontend (8080), worker (8081) - disagg.sh: frontend (8080), prefill (8081), decode (8082) - OTEL tracing automatically enables metrics on same ports when used - Ports only assigned when flags are enabled (empty otherwise) Usage: ./agg.sh --enable-metrics # Enable metrics only ./agg.sh --enable-otel # Enable tracing + metrics ./agg.sh --enable-metrics --enable-otel # Both explicit Note: Each worker process gets a unique DYN_SYSTEM_PORT when using CUDA_VISIBLE_DEVICES on the same node.

Changes metrics to be enabled by default rather than opt-in. This provides better observability out-of-the-box for all deployments. Changes: - Set ENABLE_METRICS=true by default in agg.sh and disagg.sh - Removed --enable-metrics flag (no longer needed) - Updated help text to document default port assignments - Port allocation remains the same: - agg.sh: frontend (8080), worker (8081) - disagg.sh: frontend (8080), prefill (8081), decode (8082) Usage: ./agg.sh # Metrics enabled by default ./agg.sh --enable-otel # Metrics + tracing enabled

Simplified the scripts by removing conditional metrics logic since metrics are now always enabled by default: Changes: - Removed ENABLE_METRICS variable (no longer needed) - Removed METRICS_ARGS variable and all conditionals - Hardcoded DYN_SYSTEM_PORT values directly (8080, 8081, 8082) - Always pass --enable-metrics to sglang commands - Cleaner, simpler code without unnecessary variables

- Add --enable-otel flag to agg_embed.sh, agg_router.sh, disagg_router.sh, and disagg_same_gpu.sh - Enable system metrics by default with proper port allocation for all workers - Add OTEL_SERVICE_NAME to each process for proper trace attribution - Remove deprecated disagg_dp_attn.sh launch script - Update help text to document system metrics ports 🤖 Generated with [Claude Code](https://claude.com/claude-code)

- Remove DYN_SYSTEM_PORT=8080 from all frontend commands across launch scripts - Update help text to remove frontend port from metrics documentation - Revert all OTEL changes to multimodal_agg.sh and multimodal_disagg.sh Frontend does not need DYN_SYSTEM_PORT as it doesn't expose system metrics. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

Revert disagg_same_gpu.sh to its original state before OTEL flag additions. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

coderabbitai · 2025-11-12T03:29:15Z

Walkthrough

The pull request introduces trace identifier extraction and propagation across request handlers, and enables tracing configuration in the aggregation deployment script with OTLP endpoint specification for observability.

Changes

Cohort / File(s)	Summary
Request Handler Tracing Instrumentation `components/src/dynamo/sglang/request_handlers/llm/decode_handler.py`, `components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py`	Extract `trace_id` from context in handler methods; decode handler additionally propagates trace_id to engine via `rid` parameter in non-DECODE mode
Deployment Script Tracing Configuration `examples/backends/sglang/launch/agg.sh`	Adds three command-line arguments to Python module invocation: `--enable-trace`, `--enable-metrics`, and `--otlp-traces-endpoint localhost:4317`

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Verify rid parameter naming consistency and correct propagation to engine.async_generate
Confirm OTLP traces endpoint format (localhost:4317) is valid for target infrastructure
Ensure trace_id extraction pattern is consistent across both handler files and handles null/missing trace_id gracefully

Poem

🐰 A trace hops through the handlers true,
With IDs passed in morning dew,
To OTLP endpoints they shall flow,
Observability's grand show!

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	No pull request description was provided; the required template sections (Overview, Details, Where to start, Related Issues) are all missing.	Add a comprehensive description following the template: include an overview of the tracing implementation, details of changes in decode_handler, prefill_handler, and agg.sh, guidance on where reviewers should focus, and any related GitHub issues.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: implementing unified tracing across dynamo and sglang components.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py (1)
67-87: Critical: Trace ID extracted but not propagated to the engine.

The trace_id is extracted from the context but never passed to the SGLang engine, breaking the unified tracing objective of this PR. In decode_handler.py (line 178), the trace_id is correctly passed as rid=trace_id to engine.async_generate.

Apply this diff to propagate the trace_id:
         trace_id = context.trace_id
         bootstrap_room = self._generate_bootstrap_room()
 
         bootstrap_info = {
             "bootstrap_host": self.bootstrap_host,
             "bootstrap_port": self.bootstrap_port,
             "bootstrap_room": bootstrap_room,
         }
 
         yield bootstrap_info
 
         input_param = self._get_input_param(request["request"])
 
         results = await self.engine.async_generate(
             **input_param,
             sampling_params=request["sampling_params"],
             stream=True,
             bootstrap_host=self.bootstrap_host,
             bootstrap_port=self.bootstrap_port,
             bootstrap_room=bootstrap_room,
+            rid=trace_id,
         )

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5fd96a2 and c3cf1a4.

📒 Files selected for processing (3)

components/src/dynamo/sglang/request_handlers/llm/decode_handler.py (2 hunks)
components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py (1 hunks)
examples/backends/sglang/launch/agg.sh (1 hunks)

🧰 Additional context used

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4248/merge) by ishandhanani.

components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py

[error] 67-67: Ruff: 'trace_id' assigned to but never used. (F841)

components/src/dynamo/sglang/request_handlers/llm/decode_handler.py

[warning] 1-1: Black formatting: file reformatted by Black during pre-commit.

🪛 Ruff (0.14.4)

components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py

67-67: Local variable trace_id is assigned to but never used

Remove assignment to unused variable trace_id

(F841)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: trtllm (arm64)
GitHub Check: vllm (arm64)
GitHub Check: sglang (amd64)
GitHub Check: sglang (arm64)
GitHub Check: operator (arm64)
GitHub Check: trtllm (amd64)
GitHub Check: operator (amd64)
GitHub Check: vllm (amd64)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (3)

components/src/dynamo/sglang/request_handlers/llm/decode_handler.py (3)

115-115: Trace ID extracted correctly.

The trace_id is properly extracted and will be used for trace propagation.

174-179: Trace ID correctly propagated in aggregated mode.

The trace_id is correctly passed as rid=trace_id to the engine, enabling unified tracing for aggregated serving mode.

158-165: I've examined the full decode_handler.py file. The review comment's observation is factually correct:

DECODE mode (lines 158-165): Does NOT pass rid=trace_id to engine.async_generate

Aggregated mode (line 178): DOES pass rid=trace_id

However, this pattern is consistent throughout the codebase—other bootstrap-related calls in multimodal/worker_handler.py, main.py, and prefill_handler.py also omit the rid parameter.

The inconsistency between the two modes in the same file (aggregated passes rid, DECODE does not) suggests either: (1) the rid parameter is unsupported/ineffective in bootstrap context, or (2) trace propagation in disaggregated mode is intentionally handled differently (via the context object passed to prefill_client). Without access to SGLang engine internals or design documentation, this cannot be definitively resolved.

examples/backends/sglang/launch/agg.sh

…ishan/unify-trace

examples/backends/sglang/launch/agg.sh

…ishan/unify-trace

biswapanda

LGTM

Signed-off-by: ishandhanani <[email protected]>

ishandhanani added 8 commits November 11, 2025 13:26

revert: remove --enable-otel changes from disagg_same_gpu.sh

b92cbb2

Revert disagg_same_gpu.sh to its original state before OTEL flag additions. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

feat(sglang): add trace ID handling and tracing endpoint configuration

ab16195

ishandhanani requested review from a team as code owners November 12, 2025 03:27

pull-request-size bot added the size/L label Nov 12, 2025

github-actions bot added the feat label Nov 12, 2025

Merge branch 'main' into ishan/unify-trace

c3cf1a4

pull-request-size bot added size/XS and removed size/L labels Nov 12, 2025

coderabbitai bot reviewed Nov 12, 2025

View reviewed changes

examples/backends/sglang/launch/agg.sh Outdated Show resolved Hide resolved

initial approach

49c7af9

pull-request-size bot added size/M and removed size/XS labels Nov 12, 2025

copy-pr-bot bot temporarily deployed to GITLAB November 12, 2025 03:59 Inactive

bump

2781adf

copy-pr-bot bot temporarily deployed to GITLAB November 12, 2025 04:01 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 12, 2025 04:07 Inactive

bump

fe4f656

pull-request-size bot added size/L and removed size/M labels Nov 12, 2025

copy-pr-bot bot temporarily deployed to GITLAB November 12, 2025 04:09 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 12, 2025 04:12 Inactive

copy-pr-bot bot had a problem deploying to GITLAB December 4, 2025 17:47 Failure

go

fe3c2a3

copy-pr-bot bot had a problem deploying to GITLAB December 4, 2025 17:52 Failure

ishandhanani added 2 commits December 4, 2025 17:52

Merge branch 'ishan/unify-trace' of github.com:ai-dynamo/dynamo into …

473ef21

…ishan/unify-trace

trace

7331118

pull-request-size bot added size/L and removed size/M labels Dec 4, 2025

copy-pr-bot bot had a problem deploying to GITLAB December 4, 2025 18:16 Failure

Revert even though tempo and grafana are terrible for viewing traces

bacbc59

copy-pr-bot bot had a problem deploying to GITLAB December 4, 2025 18:18 Failure

Merge branch 'main' into ishan/unify-trace

4c7b2c0

copy-pr-bot bot had a problem deploying to GITLAB December 4, 2025 18:27 Failure

biswapanda reviewed Dec 4, 2025

View reviewed changes

examples/backends/sglang/launch/agg.sh Show resolved Hide resolved

ishandhanani enabled auto-merge (squash) December 4, 2025 18:32

ishandhanani added 2 commits December 4, 2025 18:34

fix

bda625f

Merge branch 'ishan/unify-trace' of github.com:ai-dynamo/dynamo into …

7f89bae

…ishan/unify-trace

copy-pr-bot bot had a problem deploying to GITLAB December 4, 2025 18:35 Failure

tmonty12 approved these changes Dec 4, 2025

View reviewed changes

biswapanda approved these changes Dec 4, 2025

View reviewed changes

copy-pr-bot bot had a problem deploying to GITLAB December 4, 2025 18:44 Failure

Merge branch 'main' into ishan/unify-trace

87dd9f1

copy-pr-bot bot temporarily deployed to GITLAB December 5, 2025 08:56 Inactive

keivenchang requested a review from jh-nv December 5, 2025 18:32

keivenchang approved these changes Dec 6, 2025

View reviewed changes

Merge branch 'main' into ishan/unify-trace

fb44cd0

copy-pr-bot bot temporarily deployed to GITLAB December 8, 2025 21:25 Inactive

copy-pr-bot bot temporarily deployed to GITLAB December 8, 2025 21:26 Inactive

Merge branch 'main' into ishan/unify-trace

3868e6b

Signed-off-by: ishandhanani <[email protected]>

copy-pr-bot bot had a problem deploying to GITLAB December 9, 2025 02:03 Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: unified tracing across dynamo + sglang #4248

feat: unified tracing across dynamo + sglang #4248

ishandhanani commented Nov 12, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Nov 12, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

biswapanda left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

feat: unified tracing across dynamo + sglang #4248

Are you sure you want to change the base?

feat: unified tracing across dynamo + sglang #4248

Conversation

ishandhanani commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Core Implementation

How It Works

SGLang Trace Context Propagation

Example (disagg)

Uh oh!

coderabbitai bot commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

biswapanda left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ishandhanani commented Nov 12, 2025 •

edited

Loading

coderabbitai bot commented Nov 12, 2025 •

edited

Loading