feat: include ttft and total request time in nvext (engine-agnostic) #4880

PeaBrane · 2025-12-11T01:17:50Z

Per-request timing metrics via nvext

Changes

Added RequestTimingTracker in lib/llm/src/protocols/common/timing.rs
- Uses OnceLock for set-once semantics (handles disaggregated serving correctly)
- Returns ttft_ms, total_time_ms, request_received_ms
Integrated into DeltaGenerator (chat_completions + completions delta.rs)
- Created when client requests extra_fields: ["timing"]
- Timing injected into final chunk's nvext.timing
Removed timing from HTTP layer (ResponseMetricCollector)
- Cleaner separation: HTTP layer handles Prometheus metrics, delta layer handles per-request nvext
Added timing verification to _test_router_decisions_disagg test

Why not reuse `ResponseMetricCollector`?

ResponseMetricCollector lives at the HTTP boundary and won't have access to pipeline Context. Future work requires Context to propagate KV router data (cached tokens, etc.) through the pipeline and into nvext.

For future

Use Context registry to propagate KV router-specific data (e.g., num_cached_tokens) and inject into nvext
Refactor nvext injection pattern (worker_id comes from disaggregated_params, timing set directly — unify this)

Summary by CodeRabbit

New Features
- Added request timing metrics to streaming API responses, including time-to-first-token, total request duration, and request received timestamp for performance monitoring.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: PeaBrane <[email protected]>

coderabbitai · 2025-12-11T01:21:02Z

Walkthrough

A per-request timing tracker module is introduced to capture request lifecycle metrics (received time, first token, finish), then integrated into OpenAI chat completions and completions protocol handlers, with corresponding test validation for timing data accuracy.

Changes

Cohort / File(s)	Summary
Timing module infrastructure `lib/llm/src/protocols/common.rs`, `lib/llm/src/protocols/common/timing.rs`	Introduces RequestTimingTracker and TimingInfo structs for monotonic time tracking (request received, first token, finish) with optional serialization and query methods.
NvExtResponse extension `lib/llm/src/protocols/openai/nvext.rs`	Adds TimingInfo re-export and new optional timing field to NvExtResponse for serialization.
Chat completions timing integration `lib/llm/src/protocols/openai/chat_completions/delta.rs`	Computes enable_timing flag in response_generator, adds timing_tracker to DeltaGenerator, records first-token and finish events, injects timing into nvext responses.
Completions timing integration `lib/llm/src/protocols/openai/completions/delta.rs`	Computes enable_timing flag from extra_fields, adds timing_tracker to DeltaGenerator, records timing events, introduces enable_usage_for_nonstreaming() method, injects timing into nvext responses.
Test validation `tests/router/common.py`	Adds verify_response_timing() function to validate ttft_ms and total_time_ms fields, integrates timing extraction and verification into progressive request handling.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

RequestTimingTracker logic: Verify monotonic time capture, OnceLock usage, and calculation correctness for ttft_ms() and total_time_ms()
Timing injection conditions: Ensure timing is correctly injected only when data is present in both chat_completions and completions handlers
Integration consistency: Confirm enable_timing flag propagation through DeltaGeneratorOptions in both protocol variants follows the same pattern
Test coverage: Validate that timing assertion and verification logic properly validates the timing constraints (total_time_ms ≥ ttft_ms)

Poem

⏰ A tracker hops through request streams,
Catching moments, timing dreams,
First token darts, the finish calls,
Milliseconds in rabbit walls,
Speed revealed in every ball! 🐰✨

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check	✅ Passed	The title accurately describes the main change: adding per-request timing metrics (ttft and total request time) to nvext responses in an engine-agnostic way.
Description check	✅ Passed	The description covers key changes, implementation details, rationale, and future work. However, it deviates from the template structure by omitting explicit sections for Overview, Details, Where should the reviewer start, and Related Issues.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

lib/llm/src/protocols/openai/nvext.rs

Co-authored-by: Keiven C <[email protected]> Signed-off-by: Yan Ru Pei <[email protected]>

…i-dynamo#4880) Signed-off-by: PeaBrane <[email protected]> Signed-off-by: Yan Ru Pei <[email protected]> Co-authored-by: Keiven C <[email protected]>

first commit

86faf02

Signed-off-by: PeaBrane <[email protected]>

PeaBrane requested review from a team as code owners December 11, 2025 01:17

pull-request-size bot added the size/L label Dec 11, 2025

github-actions bot added the feat label Dec 11, 2025

PeaBrane requested review from rmccorm4 and zhongxuanwang-nv December 11, 2025 01:18

PeaBrane requested a review from keivenchang December 11, 2025 01:31

PeaBrane changed the title ~~feat: include ttft and total request time in nvext~~ feat: include ttft and total request time in nvext (engine-agnostic) Dec 11, 2025

PeaBrane mentioned this pull request Dec 11, 2025

feat: timing metrics reporting through nvext for vLLM backend #4661

Open

keivenchang approved these changes Dec 11, 2025

View reviewed changes

lib/llm/src/protocols/openai/nvext.rs Outdated Show resolved Hide resolved

Apply suggestion from @keivenchang

fcc894f

Co-authored-by: Keiven C <[email protected]> Signed-off-by: Yan Ru Pei <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB December 11, 2025 02:36 Inactive

PeaBrane enabled auto-merge (squash) December 11, 2025 02:36

copy-pr-bot bot temporarily deployed to GITLAB December 11, 2025 02:39 Inactive

PeaBrane merged commit 2845aa1 into main Dec 11, 2025
39 of 41 checks passed

PeaBrane deleted the rupei/timing-metrics branch December 11, 2025 04:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: include ttft and total request time in nvext (engine-agnostic) #4880

feat: include ttft and total request time in nvext (engine-agnostic) #4880

Uh oh!

PeaBrane commented Dec 11, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: include ttft and total request time in nvext (engine-agnostic) #4880

feat: include ttft and total request time in nvext (engine-agnostic) #4880

Uh oh!

Conversation

PeaBrane commented Dec 11, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Per-request timing metrics via nvext

Changes

Why not reuse ResponseMetricCollector?

For future

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PeaBrane commented Dec 11, 2025 •

edited by coderabbitai bot

Loading

Why not reuse `ResponseMetricCollector`?

coderabbitai bot commented Dec 11, 2025 •

edited

Loading