Skip to content

Conversation

@biswapanda
Copy link
Contributor

@biswapanda biswapanda commented Dec 11, 2025

Overview:

Add rust unit tests for graceful shutdown of TcpServer (request plane)

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

  • Tests
    • Added unit tests to verify graceful shutdown behavior for TCP endpoint handling with concurrent in-flight requests.

✏️ Tip: You can customize this high-level summary in your review settings.

@biswapanda biswapanda requested a review from a team as a code owner December 11, 2025 04:16
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 11, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@biswapanda biswapanda self-assigned this Dec 11, 2025
@github-actions github-actions bot added the test label Dec 11, 2025
@biswapanda biswapanda force-pushed the bis/graceful-shutdown-test-2 branch from 1f712c0 to b1d15bb Compare December 11, 2025 04:17
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 11, 2025

Walkthrough

Adds comprehensive unit test module for SharedTcpServer with SlowMockHandler to simulate in-flight TCP requests and verify graceful shutdown behavior. Tests registration, inflight request simulation, endpoint unregistration, and synchronization without modifying production code.

Changes

Cohort / File(s) Summary
SharedTcpServer test infrastructure
lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs
Introduces unit test module with SlowMockHandler to simulate delayed TCP request processing. Tests endpoint registration, inflight request handling during unregistration, and synchronization via Notify and AtomicBool primitives with timing assertions.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Synchronization logic: Verify Notify and AtomicBool usage correctly prevents race conditions between inflight request completion and unregister operation
  • Timing assertions: Confirm expected control flow timing is appropriate and doesn't create flaky test behavior
  • Mock handler behavior: Ensure SlowMockHandler accurately simulates real in-flight request scenarios

Poem

🐰 A test hopped in with handlers so slow,
To catch when requests need time to go,
Graceful shutdown, no requests denied—
The TCP endpoint now tested with pride! 🌟

Pre-merge checks

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ❓ Inconclusive The description follows the required template structure but has incomplete sections: Details and Where should the reviewer start are empty, and no specific GitHub issue is referenced. Complete the Details section explaining the test implementation, specify where reviewers should focus, and provide the actual GitHub issue number instead of #xxx placeholder.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding tests for graceful shutdown of the TCP request plane server.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs (1)

457-607: Time-based assertion on unregister_duration can be flaky; rely on event ordering instead

The test already proves the important ordering properties:

  • Line 555–559: unregister_task has not finished shortly after starting, so it’s actually blocking while the request is in flight.
  • Lines 563–575: you wait for the request to complete, then decrement inflight and call notify_one, mirroring production behavior.
  • Lines 577–582: you then wait for unregister_task to complete, ensuring it unblocks after the inflight request finishes.

The additional check that unregister_endpoint took at least 1 second (Lines 583–592) relies on real-time scheduling. Under heavy load, it’s possible for the test to resume late after request_started.notified().await, which could make unregister_duration < 1s even though the logic is correct, leading to spurious CI failures.

You can make the test more robust by dropping the Instant bookkeeping and duration comparison, and simply asserting that unregister_endpoint:

  1. Isn’t finished shortly after starting (which you already check), and
  2. Does finish once the inflight counter is decremented and the notify is fired (which the timeout around unregister_task already enforces).

Concretely, something like this keeps the semantics but removes the fragile wall-clock assertion:

-    use tokio::time::Instant;
+    // no need for Instant once we stop asserting on wall-clock duration

@@
-        let unregister_start = Instant::now();
-        tracing::debug!("Starting unregister_endpoint with inflight request");
+        tracing::debug!("Starting unregister_endpoint with inflight request");

@@
-        // Now wait for unregister to complete
-        let unregister_end = tokio::time::timeout(Duration::from_secs(2), unregister_task)
-            .await
-            .expect("unregister_endpoint should complete after inflight request finishes")
-            .expect("unregister task should not panic");
-
-        let unregister_duration = unregister_end - unregister_start;
-
-        tracing::debug!("unregister_endpoint completed in {:?}", unregister_duration);
-
-        // Verify unregister_endpoint waited for the inflight request
-        assert!(
-            unregister_duration >= Duration::from_secs(1),
-            "unregister_endpoint should have waited ~1s for inflight request, but only took {:?}",
-            unregister_duration
-        );
+        // Now wait for unregister to complete; at this point we know the inflight
+        // request has completed and we've decremented the inflight counter, so
+        // unregister_endpoint should return instead of hanging.
+        tokio::time::timeout(Duration::from_secs(2), unregister_task)
+            .await
+            .expect("unregister_endpoint should complete after inflight request finishes")
+            .expect("unregister task should not panic");

This keeps the behavioral guarantee (unregister blocks while a request is in flight and returns after it’s done) while avoiding time-sensitive flakiness.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5250303 and b1d15bb.

📒 Files selected for processing (1)
  • lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-11T03:24:47.820Z
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3004
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:271-277
Timestamp: 2025-09-11T03:24:47.820Z
Learning: In lib/runtime/src/pipeline/network/ingress/push_handler.rs, the maintainer prefers to keep the existing error comparison logic using format!("{:?}", err) == STREAM_ERR_MSG unchanged until proper error types are implemented, even though it has technical debt. Avoid suggesting changes to working legacy code that will be refactored later.

Applied to files:

  • lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs
📚 Learning: 2025-06-02T19:37:27.666Z
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.

Applied to files:

  • lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: tests (.)
  • GitHub Check: tests (launch/dynamo-run)
  • GitHub Check: tests (lib/runtime/examples)
  • GitHub Check: tests (lib/bindings/python)
  • GitHub Check: clippy (launch/dynamo-run)
  • GitHub Check: clippy (.)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: operator (amd64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: operator (arm64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: sglang (amd64)
🔇 Additional comments (1)
lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs (1)

404-455: SlowMockHandler is a good, idiomatic test double for PushWorkHandler

Using AtomicBool + Notify to signal start/completion and implementing PushWorkHandler directly keeps the test close to production behavior without adding unnecessary complexity. I don’t see issues with this design.

@biswapanda biswapanda enabled auto-merge (squash) December 11, 2025 19:13
Copy link
Contributor

@tedzhouhk tedzhouhk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding this unit test!

@biswapanda biswapanda merged commit 5e96c9a into main Dec 11, 2025
38 checks passed
@biswapanda biswapanda deleted the bis/graceful-shutdown-test-2 branch December 11, 2025 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants