Skip to content

Conversation

@tzulingk
Copy link
Contributor

@tzulingk tzulingk commented Dec 6, 2025

Overview:

Fixes backend error handling so clients receive actual TensorRT-LLM error messages instead of cryptic deserialization errors. When backends encounter errors (e.g., prompt exceeds max_seq_len), clients now see the real error message with proper HTTP status codes.

Details:

Problem:

  • Clients received "invalid type: unit variant, expected newtype variant" instead of actual backend error messages
  • Root cause: Python was sending {"finish_reason": "error"} (unit variant), but Rust expects {"finish_reason": {"error": "msg"}} (newtype variant) for FinishReason::Error(String)
  • Streaming requests had no error detection at all

Changes:

  1. Python Handler (components/src/dynamo/trtllm/request_handlers/handler_base.py):

    • Changed error responses from {"finish_reason": "error"} to {"finish_reason": {"error": error_msg}}
    • Properly serializes into Rust's FinishReason::Error(String) enum variant
    • Applied to both RequestError and general Exception catch blocks
  2. Rust Error Detection (lib/llm/src/http/service/openai.rs):

    • Added extract_backend_error_if_present() function that detects errors via multiple methods:
      • event: "error" field (from postprocessor)
      • code >= 400 in data/comment payloads
      • Comments with no data (legacy)
    • Added check_for_backend_error() call for streaming requests (previously only non-streaming)
    • Extracts HTTP status codes (400, 404, 500, etc.) from error payloads
    • Added 5 unit tests covering different error scenarios

Testing:

# All tests pass
cargo test --package dynamo-llm --lib http::service::openai::tests::test_check_for_backend_error

Run

# run frontend
python3 -m dynamo.frontend --router-mode kv --http-port 8080 >& tzuling_frontend.log &
# run worker
python3 -m dynamo.trtllm \
  --model-path "$MODEL_PATH" \
  --served-model-name "$SERVED_MODEL_NAME" \
  --extra-engine-args "$AGG_ENGINE_ARGS" \
  --max-seq-len 20 \
  --max-num-tokens 100 \
  --publish-events-and-metrics >& tzuling_backend.log &

Send prompt > max_seq_len request
curl -s localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen3-0.6B",
  "messages": [
    {
      "role": "developer",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Hello! It’s a wonderful day. How are you today? Have you had breakfast?"
    }
  ],
  "stream": false,
  "max_tokens": 300
}'

Before: {"message":"invalid type: unit variant...","code":500}
After: {"message":"The sum of prompt length (79.0)... should not exceed max_num_tokens (23)","code":500}

Where should the reviewer start?

  1. handler_base.py lines 371-378, 390-395: Python error response format changes - verify the {"finish_reason": {"error": error_msg}} format is correct
  2. openai.rs lines 723-797: The extract_backend_error_if_present() function - verify the error detection logic covers all cases
  3. openai.rs lines 919-923: Error checking added for streaming requests - ensure this doesn't impact performance
  4. openai.rs lines 2139-2221: Unit tests - verify they cover the main error scenarios

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

DIS-967

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced error response structure to include detailed error messages instead of generic error labels.
    • Improved backend error detection in streaming chat completions, enabling early error identification and response handling.
    • Strengthened error handling for OpenAI chat completion streams to prevent incomplete or malformed responses.

✏️ Tip: You can customize this high-level summary in your review settings.

@tzulingk tzulingk requested review from a team as code owners December 6, 2025 07:05
@github-actions github-actions bot added the fix label Dec 6, 2025
@tzulingk tzulingk enabled auto-merge (squash) December 6, 2025 07:07
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 6, 2025

Walkthrough

Changes error handling behavior in TRT-LLM request handler to structure error messages as dictionaries with an "error" key, and adds backend error detection to the OpenAI service's streaming chat completion handler to catch and propagate errors early from upstream services.

Changes

Cohort / File(s) Summary
TRT-LLM Error Response Restructuring
components/src/dynamo/trtllm/request_handlers/handler_base.py
Modified error response payloads to emit finish_reason as {"error": <error_msg>} dictionary instead of plain string "error". Captures exception message as string in error_msg variable and propagates it through both per-request and final error response paths.
OpenAI Service Backend Error Detection
lib/llm/src/http/service/openai.rs
Added extract_backend_error_if_present() function to detect errors from Annotated events. Added check_for_backend_error() async function to inspect stream's first event and reconstruct stream if valid. Integrated pre-check into chat completions handler before streaming. Imported NvCreateChatCompletionStreamResponse type. Added comprehensive test suite for error detection variants.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Stream reconstruction logic — Verify that check_for_backend_error() correctly preserves all stream events while enabling early error detection
  • Error extraction across event types — Confirm extract_backend_error_if_present() handles Annotated event wrappers, JSON-encoded payloads, and status code mappings consistently
  • Test coverage — Ensure tests adequately exercise error event detection, JSON parsing, empty stream edge cases, and the integrated flow in chat completions handler
  • Cross-language impact — Error format change in Python handler may affect downstream consumers expecting the original string format

Poem

🐰 Errors once hidden in plain sight,
Now wrapped in dictionaries, structured tight!
Streams flow safer, faults detected fast,
Backend whispers caught before they pass.
Hop, hop! We've caught them at the gate! ✨

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main changes: fixing backend error serialization in Python handlers and adding HTTP status code detection in Rust error handling.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering the overview, detailed problem statement, changes made to both Python and Rust components, testing instructions, and specific file references for reviewers.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
lib/llm/src/http/service/openai.rs (3)

811-811: Remove redundant import.

StreamExt is already imported at file level (line 28: use futures::{StreamExt, stream};).

-    use futures::stream::StreamExt;
-
     // Peek at the first event

830-833: Consider returning an error for empty streams instead of silently succeeding.

An empty stream from the backend likely indicates a problem (e.g., backend crashed before sending any data). Currently this returns Ok with an empty stream, which may mask issues. The downstream code will eventually fail when trying to fold/aggregate an empty response, but the error message won't indicate the backend returned nothing.

     } else {
-        // Empty stream - this shouldn't happen but handle gracefully
-        Ok(futures::stream::iter(vec![]).chain(stream))
+        // Empty stream - this indicates a backend issue
+        Err((
+            StatusCode::BAD_GATEWAY,
+            Json(ErrorMessage {
+                message: "Backend returned empty response stream".to_string(),
+                error_type: map_error_code_to_error_type(StatusCode::BAD_GATEWAY),
+                code: StatusCode::BAD_GATEWAY.as_u16(),
+            }),
+        ))
     }

2183-2312: Good test coverage for core error detection scenarios.

The tests cover the main paths through check_for_backend_error and extract_backend_error_if_present.

Consider adding tests for:

  • Data payload containing {"code": 400, "message": "..."} (strategy 2 detection)
  • Comment containing valid JSON with code < 400 (should NOT be treated as error)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 046229f and 4d52397.

📒 Files selected for processing (2)
  • components/src/dynamo/trtllm/request_handlers/handler_base.py (2 hunks)
  • lib/llm/src/http/service/openai.rs (6 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-09-11T03:24:47.820Z
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3004
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:271-277
Timestamp: 2025-09-11T03:24:47.820Z
Learning: In lib/runtime/src/pipeline/network/ingress/push_handler.rs, the maintainer prefers to keep the existing error comparison logic using format!("{:?}", err) == STREAM_ERR_MSG unchanged until proper error types are implemented, even though it has technical debt. Avoid suggesting changes to working legacy code that will be refactored later.

Applied to files:

  • lib/llm/src/http/service/openai.rs
📚 Learning: 2025-08-25T22:04:45.205Z
Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2700
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:19-28
Timestamp: 2025-08-25T22:04:45.205Z
Learning: The response_generator() method exists on multiple request types in the codebase: NvCreateChatCompletionRequest (for chat completions) and NvCreateCompletionRequest (for text completions). When making signature changes, it's important to distinguish between these different object types as they have separate implementations and call sites.

Applied to files:

  • lib/llm/src/http/service/openai.rs
📚 Learning: 2025-08-22T19:55:41.608Z
Learnt from: nachiketb-nvidia
Repo: ai-dynamo/dynamo PR: 2656
File: lib/llm/src/protocols/openai/chat_completions/delta.rs:320-327
Timestamp: 2025-08-22T19:55:41.608Z
Learning: There are two separate DeltaGenerator classes in the codebase: one for chat completions (lib/llm/src/protocols/openai/chat_completions/delta.rs with object "chat.completion.chunk") and one for text completions (lib/llm/src/protocols/openai/completions/delta.rs with object "text_completion"). They have different create_choice method signatures and serve different OpenAI API endpoints. The reasoning parsing functionality is only relevant to the chat completions DeltaGenerator.

Applied to files:

  • lib/llm/src/http/service/openai.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: operator (amd64)
  • GitHub Check: operator (arm64)
  • GitHub Check: vllm (arm64)
🔇 Additional comments (6)
components/src/dynamo/trtllm/request_handlers/handler_base.py (2)

371-377: LGTM! Error serialization format correctly updated for Rust compatibility.

The change from "finish_reason": "error" to "finish_reason": {"error": error_msg} properly serializes into Rust's FinishReason::Error(String) newtype variant, resolving the deserialization error described in the PR objectives.


388-395: LGTM! Consistent error format applied to general exception handler.

The error structure matches the RequestError path, ensuring uniform error serialization across all error scenarios.

lib/llm/src/http/service/openai.rs (4)

44-47: LGTM! Import added for stream type used in error checking.

The NvCreateChatCompletionStreamResponse import is necessary for the new check_for_backend_error function signature.


940-944: LGTM! Backend error check properly integrated before streaming begins.

The early error detection ensures clients receive meaningful error messages rather than deserialization failures. Logging includes request_id for traceability.


978-993: LGTM! Error message updated for clarity.

The change from "fold" to "parse" better describes the operation and aligns with the function name from_annotated_stream.


791-794: Remove the concern about overly aggressive comment-only fallback.

The fallback at lines 791-794 is well-guarded and intentional. It only triggers after confirming: (1) comments exist and are non-empty, (2) the attempt to parse comments as error JSON failed, and (3) event.data is None. This combination—comments without data on the first event—legitimately indicates an error condition rather than a keep-alive comment. Legitimate SSE implementations would send data alongside keep-alive or metadata comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants