Skip to content

Conversation

@ishandhanani
Copy link
Contributor

@ishandhanani ishandhanani commented Nov 12, 2025

Summary

Implements distributed trace propagation from Dynamo to SGLang, enabling end-to-end request tracing across the stack: dynamo-frontend → dynamo-worker → sglang{agg,disagg}.

Changes

Core Implementation

handler_base.py

  • Added _propagate_trace_context_to_sglang() method to BaseWorkerHandler
  • Extracts Dynamo's trace context (trace_id, span_id) from the Context object
  • Builds W3C traceparent and propagates to SGLang via trace_set_remote_propagate_context()

decode_handler.py & prefill_handler.py

How It Works

SGLang Trace Context Propagation

SGLang supports trace propagation via trace_set_remote_propagate_context(), which stores trace context keyed by bootstrap_room:

# Build trace context in SGLang's expected format
trace_context = {
    str(bootstrap_room): {
        "root_span": {"traceparent": f"00-{trace_id}-{span_id}-01"},
        "prev_span": {
            "span_id": int(span_id, 16),
            "trace_id": int(trace_id, 16),
        }
    }
}

When SGLang's trace_req_start() runs, it checks remote_trace_contexts[str(bootstrap_room)] and uses our context as the parent span.

Trace Flow

  1. Python handler receives Context, extracts trace_id/span_id
  2. Before calling engine.async_generate(), propagate trace context to SGLang
  3. SGLang's trace_req_start() finds context in remote_trace_contexts
  4. SGLang creates child spans under Dynamo's parent span
  5. All spans appear under same trace_id in Jaeger/Tempo

Result

  dynamo-frontend (trace_id: abc123)
  └─ dynamo-worker (same trace_id)
     └─ sglang Req abc1...
        ├─ TOKENIZER_DISPATCH
        ├─ SCHEDULER_INPUT_PROCESSING
        └─ DECODE

Example (disagg)

Screenshot 2025-11-11 at 9 01 16 PM

Adds `--enable-otel` command-line flag to four SGLang launch scripts for easy
OpenTelemetry tracing enablement. When enabled, the flag:

- Exports DYN_LOGGING_JSONL=true for JSON logging
- Exports OTEL_EXPORT_ENABLED=1 to enable trace export
- Sets OTEL_EXPORTER_OTLP_TRACES_ENDPOINT (defaults to http://localhost:4317)
- Sets DYN_SYSTEM_PORT=8081 to enable system metrics server
- Prefixes each process with OTEL_SERVICE_NAME for trace attribution

Modified scripts:
- examples/backends/sglang/launch/agg.sh
- examples/backends/sglang/launch/disagg.sh
- examples/backends/sglang/launch/multimodal_agg.sh
- examples/backends/sglang/launch/multimodal_disagg.sh

Usage:
  ./agg.sh --enable-otel

Note: Uses DYN_SYSTEM_PORT instead of deprecated DYN_SYSTEM_ENABLED per #4082
Adds --enable-metrics flag to agg.sh and disagg.sh with automatic port
allocation to avoid conflicts when multiple processes run on the same node.

Changes:
- Added --enable-metrics flag to control system metrics server
- Automatic port allocation:
  - agg.sh: frontend (8080), worker (8081)
  - disagg.sh: frontend (8080), prefill (8081), decode (8082)
- OTEL tracing automatically enables metrics on same ports when used
- Ports only assigned when flags are enabled (empty otherwise)

Usage:
  ./agg.sh --enable-metrics              # Enable metrics only
  ./agg.sh --enable-otel                 # Enable tracing + metrics
  ./agg.sh --enable-metrics --enable-otel  # Both explicit

Note: Each worker process gets a unique DYN_SYSTEM_PORT when using
CUDA_VISIBLE_DEVICES on the same node.
Changes metrics to be enabled by default rather than opt-in. This provides
better observability out-of-the-box for all deployments.

Changes:
- Set ENABLE_METRICS=true by default in agg.sh and disagg.sh
- Removed --enable-metrics flag (no longer needed)
- Updated help text to document default port assignments
- Port allocation remains the same:
  - agg.sh: frontend (8080), worker (8081)
  - disagg.sh: frontend (8080), prefill (8081), decode (8082)

Usage:
  ./agg.sh                  # Metrics enabled by default
  ./agg.sh --enable-otel    # Metrics + tracing enabled
Simplified the scripts by removing conditional metrics logic since metrics
are now always enabled by default:

Changes:
- Removed ENABLE_METRICS variable (no longer needed)
- Removed METRICS_ARGS variable and all conditionals
- Hardcoded DYN_SYSTEM_PORT values directly (8080, 8081, 8082)
- Always pass --enable-metrics to sglang commands
- Cleaner, simpler code without unnecessary variables
- Add --enable-otel flag to agg_embed.sh, agg_router.sh, disagg_router.sh, and disagg_same_gpu.sh
- Enable system metrics by default with proper port allocation for all workers
- Add OTEL_SERVICE_NAME to each process for proper trace attribution
- Remove deprecated disagg_dp_attn.sh launch script
- Update help text to document system metrics ports

🤖 Generated with [Claude Code](https://claude.com/claude-code)
- Remove DYN_SYSTEM_PORT=8080 from all frontend commands across launch scripts
- Update help text to remove frontend port from metrics documentation
- Revert all OTEL changes to multimodal_agg.sh and multimodal_disagg.sh

Frontend does not need DYN_SYSTEM_PORT as it doesn't expose system metrics.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Revert disagg_same_gpu.sh to its original state before OTEL flag additions.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
@ishandhanani ishandhanani requested review from a team as code owners November 12, 2025 03:27
@github-actions github-actions bot added the feat label Nov 12, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 12, 2025

Walkthrough

The pull request introduces trace identifier extraction and propagation across request handlers, and enables tracing configuration in the aggregation deployment script with OTLP endpoint specification for observability.

Changes

Cohort / File(s) Summary
Request Handler Tracing Instrumentation
components/src/dynamo/sglang/request_handlers/llm/decode_handler.py, components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py
Extract trace_id from context in handler methods; decode handler additionally propagates trace_id to engine via rid parameter in non-DECODE mode
Deployment Script Tracing Configuration
examples/backends/sglang/launch/agg.sh
Adds three command-line arguments to Python module invocation: --enable-trace, --enable-metrics, and --otlp-traces-endpoint localhost:4317

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

  • Verify rid parameter naming consistency and correct propagation to engine.async_generate
  • Confirm OTLP traces endpoint format (localhost:4317) is valid for target infrastructure
  • Ensure trace_id extraction pattern is consistent across both handler files and handles null/missing trace_id gracefully

Poem

🐰 A trace hops through the handlers true,
With IDs passed in morning dew,
To OTLP endpoints they shall flow,
Observability's grand show!

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning No pull request description was provided; the required template sections (Overview, Details, Where to start, Related Issues) are all missing. Add a comprehensive description following the template: include an overview of the tracing implementation, details of changes in decode_handler, prefill_handler, and agg.sh, guidance on where reviewers should focus, and any related GitHub issues.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: implementing unified tracing across dynamo and sglang components.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py (1)

67-87: Critical: Trace ID extracted but not propagated to the engine.

The trace_id is extracted from the context but never passed to the SGLang engine, breaking the unified tracing objective of this PR. In decode_handler.py (line 178), the trace_id is correctly passed as rid=trace_id to engine.async_generate.

Apply this diff to propagate the trace_id:

         trace_id = context.trace_id
         bootstrap_room = self._generate_bootstrap_room()
 
         bootstrap_info = {
             "bootstrap_host": self.bootstrap_host,
             "bootstrap_port": self.bootstrap_port,
             "bootstrap_room": bootstrap_room,
         }
 
         yield bootstrap_info
 
         input_param = self._get_input_param(request["request"])
 
         results = await self.engine.async_generate(
             **input_param,
             sampling_params=request["sampling_params"],
             stream=True,
             bootstrap_host=self.bootstrap_host,
             bootstrap_port=self.bootstrap_port,
             bootstrap_room=bootstrap_room,
+            rid=trace_id,
         )
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5fd96a2 and c3cf1a4.

📒 Files selected for processing (3)
  • components/src/dynamo/sglang/request_handlers/llm/decode_handler.py (2 hunks)
  • components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py (1 hunks)
  • examples/backends/sglang/launch/agg.sh (1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4248/merge) by ishandhanani.
components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py

[error] 67-67: Ruff: 'trace_id' assigned to but never used. (F841)

components/src/dynamo/sglang/request_handlers/llm/decode_handler.py

[warning] 1-1: Black formatting: file reformatted by Black during pre-commit.

🪛 Ruff (0.14.4)
components/src/dynamo/sglang/request_handlers/llm/prefill_handler.py

67-67: Local variable trace_id is assigned to but never used

Remove assignment to unused variable trace_id

(F841)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: operator (arm64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: operator (amd64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (3)
components/src/dynamo/sglang/request_handlers/llm/decode_handler.py (3)

115-115: Trace ID extracted correctly.

The trace_id is properly extracted and will be used for trace propagation.


174-179: Trace ID correctly propagated in aggregated mode.

The trace_id is correctly passed as rid=trace_id to the engine, enabling unified tracing for aggregated serving mode.


158-165: I've examined the full decode_handler.py file. The review comment's observation is factually correct:

  • DECODE mode (lines 158-165): Does NOT pass rid=trace_id to engine.async_generate
  • Aggregated mode (line 178): DOES pass rid=trace_id

However, this pattern is consistent throughout the codebase—other bootstrap-related calls in multimodal/worker_handler.py, main.py, and prefill_handler.py also omit the rid parameter.

The inconsistency between the two modes in the same file (aggregated passes rid, DECODE does not) suggests either: (1) the rid parameter is unsupported/ineffective in bootstrap context, or (2) trace propagation in disaggregated mode is intentionally handled differently (via the context object passed to prefill_client). Without access to SGLang engine internals or design documentation, this cannot be definitively resolved.


@ishandhanani ishandhanani enabled auto-merge (squash) December 4, 2025 18:32
Copy link
Contributor

@biswapanda biswapanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants