-
Notifications
You must be signed in to change notification settings - Fork 738
feat: use os assigned free port for tcp rpc #4891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis PR makes DYN_TCP_RPC_PORT optional and enables OS-assigned TCP ports. It refactors TCP server startup to bind and record the actual address/port, exposes a global accessor for the bound port, updates endpoint transport construction to be async and endpoint-aware, and updates docs and tests to reflect the changes. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Poem
Pre-merge checks❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 📒 Files selected for processing (4)
🧰 Additional context used🧠 Learnings (7)📓 Common learnings📚 Learning: 2025-09-11T03:24:47.820ZApplied to files:
📚 Learning: 2025-08-21T17:23:02.836ZApplied to files:
📚 Learning: 2025-07-14T21:25:56.930ZApplied to files:
📚 Learning: 2025-07-16T12:41:12.543ZApplied to files:
📚 Learning: 2025-06-17T00:50:44.845ZApplied to files:
📚 Learning: 2025-09-02T16:46:54.015ZApplied to files:
🧬 Code graph analysis (2)lib/runtime/src/pipeline/network/manager.rs (2)
lib/llm/src/discovery/model_manager.rs (4)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
🔇 Additional comments (17)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
lib/runtime/src/pipeline/network/manager.rs (2)
25-39: Consider logging a warning ifset_actual_tcp_rpc_portis called multiple times.The current implementation silently ignores subsequent calls to
set_actual_tcp_rpc_port. While the comment notes this "shouldn't happen in normal operation," silently ignoring could mask configuration issues or unexpected behavior in edge cases (e.g., tests, service restarts)./// Set the actual TCP RPC port (called internally after server binds). fn set_actual_tcp_rpc_port(port: u16) { - // Ignore if already set (shouldn't happen in normal operation) - let _ = ACTUAL_TCP_RPC_PORT.set(port); + if let Err(existing) = ACTUAL_TCP_RPC_PORT.set(port) { + tracing::warn!( + existing_port = existing, + new_port = port, + "TCP RPC port already set, ignoring new value" + ); + } }
29-33: Returning 0 as a sentinel value could cause subtle bugs.
get_actual_tcp_rpc_port()returns0if the TCP server hasn't started yet. Port 0 is a valid value (meaning "OS assigns port"), but returning it here could lead to invalid transport addresses being constructed (e.g.,host:0/endpoint) if called prematurely.Consider documenting this behavior clearly or returning an
Option<u16>to make the "not yet set" state explicit.-/// Get the actual TCP RPC port that the server is listening on. -/// Returns 0 if the TCP server hasn't been started yet. -pub fn get_actual_tcp_rpc_port() -> u16 { - ACTUAL_TCP_RPC_PORT.get().copied().unwrap_or(0) -} +/// Get the actual TCP RPC port that the server is listening on. +/// +/// # Returns +/// - `Some(port)` if the TCP server has bound and the port is known +/// - `None` if the TCP server hasn't been started yet +/// +/// # Panics +/// Callers should ensure the TCP server is started before calling this in TCP mode. +pub fn get_actual_tcp_rpc_port() -> Option<u16> { + ACTUAL_TCP_RPC_PORT.get().copied() +}Note: This would require updating
endpoint.rsto handleNoneappropriately, potentially with a clearer error message.lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs (1)
86-88: Consider handling potential panic fromunwrap()on RwLock.While
unwrap()on astd::sync::RwLockwrite guard only panics if the lock is poisoned (another thread panicked while holding it), this is a single-writer scenario and poisoning is extremely unlikely. However, for robustness in a server context:- *self.actual_addr.write().unwrap() = Some(actual_addr); + match self.actual_addr.write() { + Ok(mut guard) => *guard = Some(actual_addr), + Err(poisoned) => { + tracing::warn!("RwLock poisoned, recovering"); + *poisoned.into_inner() = Some(actual_addr); + } + }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
docs/guides/request_plane.md(4 hunks)lib/runtime/src/component/endpoint.rs(1 hunks)lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs(5 hunks)lib/runtime/src/pipeline/network/manager.rs(5 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-21T17:23:02.836Z
Learnt from: michaelfeil
Repo: ai-dynamo/dynamo PR: 2591
File: lib/bindings/python/rust/http.rs:0-0
Timestamp: 2025-08-21T17:23:02.836Z
Learning: In lib/bindings/python/rust/http.rs, the enable_endpoint method uses EndpointType::all() to dynamically support all available endpoint types with case-insensitive matching, which is more maintainable than hardcoded match statements for endpoint type mappings.
Applied to files:
lib/runtime/src/component/endpoint.rslib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs
📚 Learning: 2025-06-02T19:37:27.666Z
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.
Applied to files:
lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs
🧬 Code graph analysis (2)
lib/runtime/src/pipeline/network/manager.rs (1)
lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs (2)
new(58-65)new(472-479)
lib/runtime/src/component/endpoint.rs (1)
lib/runtime/src/pipeline/network/manager.rs (1)
get_actual_tcp_rpc_port(31-33)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: trtllm (amd64)
- GitHub Check: operator (amd64)
- GitHub Check: sglang (amd64)
- GitHub Check: vllm (arm64)
- GitHub Check: Build and Test - dynamo
- GitHub Check: clippy (lib/bindings/python)
- GitHub Check: tests (launch/dynamo-run)
- GitHub Check: tests (lib/runtime/examples)
- GitHub Check: tests (lib/bindings/python)
- GitHub Check: clippy (launch/dynamo-run)
- GitHub Check: tests (.)
- GitHub Check: clippy (.)
🔇 Additional comments (8)
docs/guides/request_plane.md (2)
111-119: Documentation accurately reflects the new TCP port behavior.The documentation clearly explains that
DYN_TCP_RPC_PORTis now optional and that OS-assigned ports are recommended for most deployments. The guidance on when to set a fixed port (e.g., firewall rules) is helpful.
129-134: Good addition of new TCP configuration options.The documentation now includes
DYN_TCP_CONNECT_TIMEOUTandDYN_TCP_CHANNEL_BUFFERwhich provides comprehensive coverage of TCP configuration options.lib/runtime/src/pipeline/network/manager.rs (1)
268-297: TCP server creation flow looks correct.The implementation properly:
- Binds with port 0 when
tcp_portisNone- Awaits
bind_and_start()to get the actual address- Stores the actual port globally before returning
- Logs both the source and actual values
This ensures the port is available before any endpoint registration occurs.
lib/runtime/src/pipeline/network/ingress/shared_tcp_endpoint.rs (4)
41-43: Usingstd::sync::RwLockin async context is acceptable here.While
std::sync::RwLockcan block the async runtime, in this case it's used for very brief read/write operations on a singleOption<SocketAddr>. The lock is never held across await points, so this is safe and avoids the overhead oftokio::sync::RwLock.
74-96: Clean implementation ofbind_and_start.The method properly:
- Binds first, capturing the actual address
- Stores the address before spawning the accept loop
- Returns the actual address for the caller to use
This ensures the address is available immediately after the method returns.
205-215: Legacystart()method properly delegates to new implementation.The refactored
start()method maintains backward compatibility while using the newbind_and_start()internally. It correctly waits on the cancellation token after starting.
512-662: Comprehensive test for graceful shutdown behavior.The test validates that
unregister_endpointproperly waits for inflight requests to complete. A few observations:
The test manually increments/decrements the inflight counter (lines 573, 629-630) to simulate tracking, which mirrors the real code flow but could diverge if implementation changes.
The test uses reasonable timeouts and assertions to verify the waiting behavior.
Overall, this is a valuable addition for ensuring graceful shutdown correctness.
lib/runtime/src/component/endpoint.rs (1)
257-268: Proper ordering confirmed: TCP port is set before endpoint registration.The implementation correctly retrieves the actual bound port via
get_actual_tcp_rpc_port()after the TCP server has completed binding. The call ordering is guaranteed becauserequest_plane_server().await(line 141) is awaited before anybuild_transport_type()calls. The server initialization usesget_or_try_init()to ensure single, synchronous binding, andcreate_tcp_server()callsset_actual_tcp_rpc_port()before returning control to the endpoint registration code.
kthui
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding config parity between TCP and NATS quickly!
I think there could be a race condition that will cause endpoints to incorrectly advertise port 0:
In
EndpointConfigBuilder::start(line 95), we callbuild_transport_type(...)which internally callsmanager::get_actual_tcp_rpc_port(). At this exact point in execution, the request plane server has not been started yet, so the globalACTUAL_TCP_RPC_PORTis unset, and the function returns default0.The server is only initialized later at line 141 (
endpoint.drt().request_plane_server().await?). While that call eventually sets the global port variable, it's too late that thetransportvariable has already been created with port0.
(summarized by AI)
Fixed issue was handled in e02bb11 Latest flow ~
|
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
kthui
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Overview:
New Features
Documentation
closes: DYN-1583
Summary by CodeRabbit
New Features
DYN_TCP_RPC_PORTis unset; actual bound port retrievable at runtime.DYN_TCP_CONNECT_TIMEOUT,DYN_TCP_CHANNEL_BUFFER).Documentation
Bug Fixes
✏️ Tip: You can customize this high-level summary in your review settings.