-
Notifications
You must be signed in to change notification settings - Fork 739
test: graceful shutdown for tcp req plane #4883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1f712c0 to
b1d15bb
Compare
WalkthroughAdds comprehensive unit test module for SharedTcpServer with SlowMockHandler to simulate in-flight TCP requests and verify graceful shutdown behavior. Tests registration, inflight request simulation, endpoint unregistration, and synchronization without modifying production code. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs (1)
457-607: Time-based assertion onunregister_durationcan be flaky; rely on event ordering insteadThe test already proves the important ordering properties:
- Line 555–559:
unregister_taskhas not finished shortly after starting, so it’s actually blocking while the request is in flight.- Lines 563–575: you wait for the request to complete, then decrement
inflightand callnotify_one, mirroring production behavior.- Lines 577–582: you then wait for
unregister_taskto complete, ensuring it unblocks after the inflight request finishes.The additional check that
unregister_endpointtook at least 1 second (Lines 583–592) relies on real-time scheduling. Under heavy load, it’s possible for the test to resume late afterrequest_started.notified().await, which could makeunregister_duration< 1s even though the logic is correct, leading to spurious CI failures.You can make the test more robust by dropping the
Instantbookkeeping and duration comparison, and simply asserting thatunregister_endpoint:
- Isn’t finished shortly after starting (which you already check), and
- Does finish once the inflight counter is decremented and the notify is fired (which the
timeoutaroundunregister_taskalready enforces).Concretely, something like this keeps the semantics but removes the fragile wall-clock assertion:
- use tokio::time::Instant; + // no need for Instant once we stop asserting on wall-clock duration @@ - let unregister_start = Instant::now(); - tracing::debug!("Starting unregister_endpoint with inflight request"); + tracing::debug!("Starting unregister_endpoint with inflight request"); @@ - // Now wait for unregister to complete - let unregister_end = tokio::time::timeout(Duration::from_secs(2), unregister_task) - .await - .expect("unregister_endpoint should complete after inflight request finishes") - .expect("unregister task should not panic"); - - let unregister_duration = unregister_end - unregister_start; - - tracing::debug!("unregister_endpoint completed in {:?}", unregister_duration); - - // Verify unregister_endpoint waited for the inflight request - assert!( - unregister_duration >= Duration::from_secs(1), - "unregister_endpoint should have waited ~1s for inflight request, but only took {:?}", - unregister_duration - ); + // Now wait for unregister to complete; at this point we know the inflight + // request has completed and we've decremented the inflight counter, so + // unregister_endpoint should return instead of hanging. + tokio::time::timeout(Duration::from_secs(2), unregister_task) + .await + .expect("unregister_endpoint should complete after inflight request finishes") + .expect("unregister task should not panic");This keeps the behavioral guarantee (unregister blocks while a request is in flight and returns after it’s done) while avoiding time-sensitive flakiness.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-11T03:24:47.820Z
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3004
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:271-277
Timestamp: 2025-09-11T03:24:47.820Z
Learning: In lib/runtime/src/pipeline/network/ingress/push_handler.rs, the maintainer prefers to keep the existing error comparison logic using format!("{:?}", err) == STREAM_ERR_MSG unchanged until proper error types are implemented, even though it has technical debt. Avoid suggesting changes to working legacy code that will be refactored later.
Applied to files:
lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs
📚 Learning: 2025-06-02T19:37:27.666Z
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.
Applied to files:
lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: Build and Test - dynamo
- GitHub Check: tests (.)
- GitHub Check: tests (launch/dynamo-run)
- GitHub Check: tests (lib/runtime/examples)
- GitHub Check: tests (lib/bindings/python)
- GitHub Check: clippy (launch/dynamo-run)
- GitHub Check: clippy (.)
- GitHub Check: trtllm (amd64)
- GitHub Check: vllm (arm64)
- GitHub Check: operator (amd64)
- GitHub Check: trtllm (arm64)
- GitHub Check: operator (arm64)
- GitHub Check: vllm (amd64)
- GitHub Check: sglang (arm64)
- GitHub Check: sglang (amd64)
🔇 Additional comments (1)
lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs (1)
404-455:SlowMockHandleris a good, idiomatic test double forPushWorkHandlerUsing
AtomicBool+Notifyto signal start/completion and implementingPushWorkHandlerdirectly keeps the test close to production behavior without adding unnecessary complexity. I don’t see issues with this design.
tedzhouhk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for adding this unit test!
Overview:
Add rust unit tests for graceful shutdown of TcpServer (request plane)
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.