Skip to content

Conversation

@PeaBrane
Copy link
Contributor

@PeaBrane PeaBrane commented Dec 11, 2025

Per-request timing metrics via nvext

Changes

  • Added RequestTimingTracker in lib/llm/src/protocols/common/timing.rs
    • Uses OnceLock for set-once semantics (handles disaggregated serving correctly)
    • Returns ttft_ms, total_time_ms, request_received_ms
  • Integrated into DeltaGenerator (chat_completions + completions delta.rs)
    • Created when client requests extra_fields: ["timing"]
    • Timing injected into final chunk's nvext.timing
  • Removed timing from HTTP layer (ResponseMetricCollector)
    • Cleaner separation: HTTP layer handles Prometheus metrics, delta layer handles per-request nvext
  • Added timing verification to _test_router_decisions_disagg test

Why not reuse ResponseMetricCollector?

ResponseMetricCollector lives at the HTTP boundary and won't have access to pipeline Context. Future work requires Context to propagate KV router data (cached tokens, etc.) through the pipeline and into nvext.

For future

  • Use Context registry to propagate KV router-specific data (e.g., num_cached_tokens) and inject into nvext
  • Refactor nvext injection pattern (worker_id comes from disaggregated_params, timing set directly — unify this)

Summary by CodeRabbit

  • New Features
    • Added request timing metrics to streaming API responses, including time-to-first-token, total request duration, and request received timestamp for performance monitoring.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: PeaBrane <[email protected]>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 11, 2025

Walkthrough

A per-request timing tracker module is introduced to capture request lifecycle metrics (received time, first token, finish), then integrated into OpenAI chat completions and completions protocol handlers, with corresponding test validation for timing data accuracy.

Changes

Cohort / File(s) Summary
Timing module infrastructure
lib/llm/src/protocols/common.rs, lib/llm/src/protocols/common/timing.rs
Introduces RequestTimingTracker and TimingInfo structs for monotonic time tracking (request received, first token, finish) with optional serialization and query methods.
NvExtResponse extension
lib/llm/src/protocols/openai/nvext.rs
Adds TimingInfo re-export and new optional timing field to NvExtResponse for serialization.
Chat completions timing integration
lib/llm/src/protocols/openai/chat_completions/delta.rs
Computes enable_timing flag in response_generator, adds timing_tracker to DeltaGenerator, records first-token and finish events, injects timing into nvext responses.
Completions timing integration
lib/llm/src/protocols/openai/completions/delta.rs
Computes enable_timing flag from extra_fields, adds timing_tracker to DeltaGenerator, records timing events, introduces enable_usage_for_nonstreaming() method, injects timing into nvext responses.
Test validation
tests/router/common.py
Adds verify_response_timing() function to validate ttft_ms and total_time_ms fields, integrates timing extraction and verification into progressive request handling.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • RequestTimingTracker logic: Verify monotonic time capture, OnceLock usage, and calculation correctness for ttft_ms() and total_time_ms()
  • Timing injection conditions: Ensure timing is correctly injected only when data is present in both chat_completions and completions handlers
  • Integration consistency: Confirm enable_timing flag propagation through DeltaGeneratorOptions in both protocol variants follows the same pattern
  • Test coverage: Validate that timing assertion and verification logic properly validates the timing constraints (total_time_ms ≥ ttft_ms)

Poem

⏰ A tracker hops through request streams,
Catching moments, timing dreams,
First token darts, the finish calls,
Milliseconds in rabbit walls,
Speed revealed in every ball! 🐰✨

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The title accurately describes the main change: adding per-request timing metrics (ttft and total request time) to nvext responses in an engine-agnostic way.
Description check ✅ Passed The description covers key changes, implementation details, rationale, and future work. However, it deviates from the template structure by omitting explicit sections for Overview, Details, Where should the reviewer start, and Related Issues.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@PeaBrane PeaBrane requested a review from keivenchang December 11, 2025 01:31
@PeaBrane PeaBrane changed the title feat: include ttft and total request time in nvext feat: include ttft and total request time in nvext (engine-agnostic) Dec 11, 2025
Co-authored-by: Keiven C <[email protected]>
Signed-off-by: Yan Ru Pei <[email protected]>
@PeaBrane PeaBrane enabled auto-merge (squash) December 11, 2025 02:36
@PeaBrane PeaBrane merged commit 2845aa1 into main Dec 11, 2025
39 of 41 checks passed
@PeaBrane PeaBrane deleted the rupei/timing-metrics branch December 11, 2025 04:59
zxue2 pushed a commit to zxue2/dynamo that referenced this pull request Dec 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants