test: add dynamic port allocation for fault_tolerant test execution #4835

keivenchang · 2025-12-10T00:12:57Z

Overview:

Test isolation via dynamic port allocation. Tests now allocate unique ports at runtime instead of using hardcoded values, eliminating port conflicts to avoid port collisions (when previous tests don't terminate properly).

Details:

Add tests/utils/port_utils.py for thread-safe port allocation with file locking
Update conftest.py: EtcdServer/NatsServer support dynamic ports (default 2379/4222, dynamic if port=None/0)
Update vLLM, SGLang, TRT-LLM cancellation/migration tests to use dynamic ports
Add standardized timing documentation to all fault tolerance tests

Where should the reviewer start?

tests/utils/port_utils.py - Port allocation logic using File Lock
tests/conftest.py - Dynamic port support for EtcdServer/NatsServer
tests/fault_tolerance/cancellation/test_vllm.py - Example port allocation pattern

Related Issues:

Relates to OPS-2436

/coderabbit profile chill

Summary by CodeRabbit

Tests

Enhanced test infrastructure with dynamic port allocation to support parallel test execution without conflicts
Improved port management and cleanup mechanisms for test services and worker processes
Updated fault tolerance and migration test suites to use dynamically assigned ports instead of fixed values, enabling more reliable concurrent testing

_{✏️ Tip: You can customize this high-level summary in your review settings.}

kthui

Looks good to me overall! This is a great enhancement to stop port conflicts between test cases.

Left a few minor comments, but the accidental 16384 -> 8192 prompt length changes would need to be fixed before merging.

tests/conftest.py

tests/utils/port_utils.py

tests/fault_tolerance/migration/utils.py

tests/fault_tolerance/cancellation/utils.py

tests/fault_tolerance/cancellation/test_sglang.py

kthui · 2025-12-10T02:20:17Z

FYI, all the Cancellation / Migration tests are not ran as a part of pre_merge, we might need a "special" PR that ran all those tests on the CI to verify.

coderabbitai · 2025-12-10T02:50:45Z

Walkthrough

This pull request introduces dynamic port allocation for parallel test execution by adding flock-based port utilities, updating core test fixtures and process management, and refactoring multiple test components to accept explicit port parameters instead of hardcoded values.

Changes

Cohort / File(s)	Summary
Port Allocation Infrastructure `tests/utils/port_utils.py`	New module providing flock-based port allocation with stale-entry cleanup, supporting `allocate_free_port(s)` and `free_port(s)` functions for parallel-safe port management within 30000–32767 range.
Test Fixtures & Configuration `tests/conftest.py`	Added `runtime_services_dynamic_ports` fixture provisioning NATS and Etcd servers with port=0; added `__exit__` cleanup methods to `EtcdServer` and `NatsServer` classes to release allocated ports.
Base Process Management `tests/utils/managed_process.py`	Added `get_pid()` method to base `ManagedProcess` class; removed redundant override from `DynamoFrontendProcess`.
Cancellation Test Utilities `tests/fault_tolerance/cancellation/utils.py`	Extended `DynamoFrontendProcess.__init__` and request helpers (`send_completion_request`, `send_chat_completion_request`, `send_cancellable_request`, `read_streaming_responses`) to accept `frontend_port` parameter; updated request URLs and health checks to use dynamic ports.
Migration Test Utilities `tests/fault_tolerance/migration/utils.py`	Extended `DynamoFrontendProcess.__init__` to accept `frontend_port` and configure --http-port flag; updated `start_completion_request` signature; changed `terminate_existing` to False; added return-type annotation to `determine_request_receiving_worker`.
Cancellation Tests `tests/fault_tolerance/cancellation/test_sglang.py`, `test_trtllm.py`, `test_vllm.py`	Updated `DynamoWorkerProcess.__init__` to accept `system_port` and `frontend_port`; added `__exit__` for port cleanup; switched test functions to `runtime_services_dynamic_ports` fixture; updated health checks and request invocation to use allocated ports.
Migration Tests `tests/fault_tolerance/migration/test_sglang.py`, `test_trtllm.py`, `test_vllm.py`	Updated `DynamoWorkerProcess.__init__` to accept `system_port` and `frontend_port`; added `__exit__` for port cleanup; removed `get_pid` method; switched tests to `runtime_services_dynamic_ports` fixture and dynamic port allocation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

Port lifecycle management: Review port allocation/deallocation logic in port_utils.py, especially flock-based synchronization and stale-entry cleanup to ensure thread-safety and prevent port leaks.
Fixture teardown: Verify __exit__ methods in EtcdServer, NatsServer, and DynamoWorkerProcess properly release ports before parent cleanup, with exception handling.
API consistency: Ensure all DynamoWorkerProcess and DynamoFrontendProcess instantiations across six test files consistently pass system_port and frontend_port parameters; verify health-check URLs and environment variables (DYN_SYSTEM_PORT, DYN_HTTP_PORT) are correctly wired.
Test orchestration: Validate that port allocation and cleanup flow correctly in each test scenario (aggregated, decode, prefill, KV-transfer, migration) without leaving orphaned processes or leaked ports.

Poem

🐰 Hark! Dynamic ports now flow through tests so fair,
No collisions, no conflicts—parallel's in the air!
With locks and with cleanup, each port finds its home,
A symphony of sockets—no more will they roam!
Eight-eighty-oh-eight is a memory now,
Thirty thousand and beyond—rabbits take a bow! 🎉

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main change: introducing dynamic port allocation for fault tolerance tests.
Description check	✅ Passed	The description covers the key sections: Overview, Details, Where to start, and Related Issues. All required template elements are present and substantively filled.
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tests/fault_tolerance/cancellation/test_vllm.py (1)

96-97: Hardcoded ports are inconsistent with dynamic port allocation strategy used elsewhere in the test.

DYN_VLLM_KV_EVENT_PORT (20082) and VLLM_NIXL_SIDE_CHANNEL_PORT (5601) are hardcoded, while frontend_port, prefill_system_port, and decode_system_port are dynamically allocated with comments stating "Allocate ports to avoid conflicts with parallel tests." This inconsistency means if parallel execution is enabled in the future (e.g., via pytest.mark.parallel), these hardcoded ports could cause conflicts between concurrent test runs. Align the port allocation strategy by making these ports dynamic as well, using allocate_free_port() to match the pattern used for other ports.

tests/fault_tolerance/migration/test_vllm.py (1)

129-144: Consider using runtime_services_dynamic_ports for consistency with TRT-LLM and SGLang migration tests.

The TRT-LLM and SGLang migration tests use runtime_services_dynamic_ports, which automatically sets NATS_SERVER and ETCD_ENDPOINTS environment variables for dynamic service discovery. The vLLM migration tests use the older runtime_services fixture, which does not configure these environment variables. Since worker processes inherit os.environ and may depend on these variables for NATS/Etcd connectivity, updating vLLM to use runtime_services_dynamic_ports would align with the other frameworks and ensure consistent behavior across all migration tests.

♻️ Duplicate comments (3)

tests/utils/port_utils.py (1)

75-78: Validation uses 1024 instead of _PORT_MIN.

The docstring and comments indicate the port range is 30000-32767, but the validation allows any port >= 1024. This is inconsistent with the documented i16 range constraint.

This was already flagged in a past review comment by kthui suggesting to use _PORT_MIN instead of 1024.

tests/fault_tolerance/cancellation/utils.py (1)

271-275: max_tokens reduced from 16384 to 8192.

Past review comments from kthui indicate that max_tokens values were intentionally set to 16384 for faster machines and should not be changed.

This was already flagged in past review comments. Verify if this reduction was intentional or if it should be reverted to 16384.

tests/fault_tolerance/cancellation/test_sglang.py (1)

309-312: Past review addressed: port allocation pattern.

A previous reviewer suggested consolidating the three allocate_free_port calls into one allocate_free_ports(3, 30000) call. This is a valid simplification if the team prefers it.

🧹 Nitpick comments (8)

tests/conftest.py (4)
303-319: return inside finally can silently swallow exceptions.

The return super().__exit__(...) inside the finally block will suppress any exception raised in the try block. If free_ports() raises an unexpected error that isn't caught (e.g., a KeyboardInterrupt), it will be silently discarded.

Consider storing the result and returning after the finally block completes:
     def __exit__(self, exc_type, exc_val, exc_tb):
         """Release allocated ports when server exits."""
         ports_to_release = []
         try:
             # Release allocated ports BEFORE calling parent __exit__
             if hasattr(self, "port") and self.port is not None:
                 ports_to_release.append(self.port)
             if hasattr(self, "peer_port") and self.peer_port is not None:
                 ports_to_release.append(self.peer_port)
 
             if ports_to_release:
                 free_ports(ports_to_release)
         except Exception as e:
             logging.warning(f"Failed to release EtcdServer port: {e}")
-        finally:
-            # Always call parent __exit__ to terminate the process
-            return super().__exit__(exc_type, exc_val, exc_tb)
+        # Always call parent __exit__ to terminate the process
+        return super().__exit__(exc_type, exc_val, exc_tb)
The finally block is unnecessary here since the except clause handles the only expected failure path, and moving the parent call outside ensures exceptions propagate correctly.

351-361: Same return inside finally issue as EtcdServer.__exit__.

Apply the same fix here for consistency:
     def __exit__(self, exc_type, exc_val, exc_tb):
         """Release allocated port when server exits."""
         try:
             # Release allocated port BEFORE calling parent __exit__
             if hasattr(self, "port") and self.port is not None:
                 free_port(self.port)
         except Exception as e:
             logging.warning(f"Failed to release NatsServer port: {e}")
-        finally:
-            # Always call parent __exit__ to terminate the process
-            return super().__exit__(exc_type, exc_val, exc_tb)
+        # Always call parent __exit__ to terminate the process
+        return super().__exit__(exc_type, exc_val, exc_tb)
513-513: Redundant import of os.

os is already imported at the top of the file (line 5). This local import is unnecessary.
-        import os
-
         # Port cleanup is now handled in NatsServer and EtcdServer __exit__ methods
351-361: Port is released even when not dynamically allocated.

When using the default port (e.g., NatsServer(request, port=4222)), free_port(4222) is called even though port 4222 was never registered in the allocation registry. While this is benign (the registry lookup is a no-op for unregistered ports), it's semantically cleaner to only free ports that were dynamically allocated.

Consider tracking whether the port was dynamically allocated:
     def __init__(self, request, port=4222, timeout=300):
         # Allocate a free port if port is None or 0
         use_random_port = port == 0
         if use_random_port:
             # Start from 4223 (nats-server default 4222 + 1)
             port = allocate_free_port(4223)
 
         self.port = port
+        self._dynamically_allocated = use_random_port
         ...

     def __exit__(self, exc_type, exc_val, exc_tb):
         """Release allocated port when server exits."""
         try:
-            if hasattr(self, "port") and self.port is not None:
+            if getattr(self, "_dynamically_allocated", False) and self.port is not None:
                 free_port(self.port)
         ...
The same pattern could apply to EtcdServer.
tests/utils/port_utils.py (2)
122-129: Socket file descriptor may leak on bind failure.

When sock.bind() raises OSError, the socket is not closed before continuing to the next iteration. While Python's garbage collector will eventually reclaim it, explicitly closing the socket ensures deterministic cleanup, especially during rapid port scanning.
                 # Try to bind to verify it's actually free
                 try:
                     sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
-                    sock.bind(("", port))
-                    sock.close()
-                    ports.append(port)
-                    registry[str(port)] = time.time()
+                    try:
+                        sock.bind(("", port))
+                        ports.append(port)
+                        registry[str(port)] = time.time()
+                    finally:
+                        sock.close()
                 except OSError:
                     continue
49-56: Consider documenting the stale allocation timeout.

The 15-minute (max_age=900.0) stale allocation cleanup is a reasonable default, but tests that run longer or crash unexpectedly may hit port exhaustion. Consider adding this detail to the module docstring or function docstring for visibility.
tests/fault_tolerance/migration/test_vllm.py (1)
44-112: Missing port cleanup in __exit__ — inconsistent with SGLang tests.

The SGLang worker processes (test_sglang.py in both cancellation and migration) implement __exit__ to release allocated ports via free_port(). This class lacks that cleanup, which could cause port leaks if the allocation tracking relies on explicit release.

If allocate_free_port tracks allocated ports (via file lock or registry), add cleanup:
+from tests.utils.port_utils import free_port
+
 class DynamoWorkerProcess(ManagedProcess):
     ...
+    def __init__(self, ...):
+        ...
+        self.system_port = system_port
+        ...
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        """Release allocated port when worker exits."""
+        if hasattr(self, "system_port") and self.system_port is not None:
+            try:
+                free_port(self.system_port)
+            except Exception:
+                pass
+        return super().__exit__(exc_type, exc_val, exc_tb)
Alternatively, if port cleanup is handled elsewhere (e.g., fixture teardown), this can be skipped — but consistency across test files would improve maintainability.
tests/fault_tolerance/migration/test_trtllm.py (1)

44-107: Missing port cleanup in __exit__ — same issue as vLLM tests.

For consistency with SGLang tests that release ports in __exit__, consider adding cleanup here as well.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 96fe63f and 909fb97.

📒 Files selected for processing (11)

tests/conftest.py (3 hunks)
tests/fault_tolerance/cancellation/test_sglang.py (11 hunks)
tests/fault_tolerance/cancellation/test_trtllm.py (14 hunks)
tests/fault_tolerance/cancellation/test_vllm.py (11 hunks)
tests/fault_tolerance/cancellation/utils.py (9 hunks)
tests/fault_tolerance/migration/test_sglang.py (11 hunks)
tests/fault_tolerance/migration/test_trtllm.py (10 hunks)
tests/fault_tolerance/migration/test_vllm.py (8 hunks)
tests/fault_tolerance/migration/utils.py (5 hunks)
tests/utils/managed_process.py (1 hunks)
tests/utils/port_utils.py (1 hunks)

🧰 Additional context used

🧠 Learnings (4)

📓 Common learnings

Learnt from: tzulingk
Repo: ai-dynamo/dynamo PR: 3194
File: tests/utils/managed_deployment.py:620-656
Timestamp: 2025-09-24T17:45:54.735Z
Learning: In the fault tolerance test infrastructure for Dynamo, port-forwards created by the `port_forward` method in `ManagedDeployment` are intended to run in the background and should not be stopped immediately after use. They are designed for persistent connections that can be reused across multiple metrics collections during test execution.

Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 2989
File: lib/llm/src/block_manager/distributed/transfer.rs:6-6
Timestamp: 2025-09-18T21:47:44.143Z
Learning: For PR ai-dynamo/dynamo#2989, the ConnectorTransferBatcher architectural issues will be addressed in a follow-up PR by removing the duplicate batching logic and integrating distributed transfers with the existing TransferBatcher + LocalTransferManager pipeline, rather than adding bounded concurrency primitives like Semaphore.

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3051
File: container/templates/Dockerfile.trtllm.j2:424-437
Timestamp: 2025-09-16T17:16:03.785Z
Learning: keivenchang prioritizes maintaining exact backward compatibility during migration/refactoring PRs, even when bugs are identified in the original code. Fixes should be deferred to separate PRs after the migration is complete.

📚 Learning: 2025-09-24T17:45:54.735Z

Learnt from: tzulingk
Repo: ai-dynamo/dynamo PR: 3194
File: tests/utils/managed_deployment.py:620-656
Timestamp: 2025-09-24T17:45:54.735Z
Learning: In the fault tolerance test infrastructure for Dynamo, port-forwards created by the `port_forward` method in `ManagedDeployment` are intended to run in the background and should not be stopped immediately after use. They are designed for persistent connections that can be reused across multiple metrics collections during test execution.

Applied to files:

tests/fault_tolerance/cancellation/utils.py
tests/fault_tolerance/cancellation/test_trtllm.py
tests/fault_tolerance/migration/test_trtllm.py
tests/fault_tolerance/migration/test_sglang.py
tests/fault_tolerance/cancellation/test_sglang.py
tests/fault_tolerance/migration/utils.py
tests/fault_tolerance/migration/test_vllm.py
tests/fault_tolerance/cancellation/test_vllm.py

📚 Learning: 2025-10-03T01:53:15.023Z

Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3391
File: tests/fault_tolerance/cancellation/utils.py:323-390
Timestamp: 2025-10-03T01:53:15.023Z
Learning: In `tests/fault_tolerance/cancellation/utils.py`, the `poll_for_pattern` function's default `max_wait_ms` of 500ms is intentionally set to detect failures in cancellation signal propagation to TRT-LLM. This timeout covers only the time for the cancellation signal to reach TRT-LLM (not any generation routine), and if cancellation takes longer than 0.5s to propagate, it should be considered a test failure.

Applied to files:

tests/fault_tolerance/cancellation/test_trtllm.py

📚 Learning: 2025-09-24T17:26:17.225Z

Learnt from: tzulingk
Repo: ai-dynamo/dynamo PR: 3194
File: tests/fault_tolerance/deploy/client.py:196-216
Timestamp: 2025-09-24T17:26:17.225Z
Learning: In tests/fault_tolerance/deploy/client.py, when no pods are ready, the port defaults to 0, creating URL "http://localhost:0/{endpoint}". The requests.post() call will raise ConnectionError or ConnectTimeout, which are caught by requests.RequestException in _single_request function.

Applied to files:

tests/fault_tolerance/cancellation/test_trtllm.py
tests/fault_tolerance/cancellation/test_sglang.py
tests/fault_tolerance/migration/utils.py
tests/fault_tolerance/cancellation/test_vllm.py

🧬 Code graph analysis (6)

tests/fault_tolerance/cancellation/utils.py (2)

tests/fault_tolerance/test_vllm_health_check.py (1)

send_completion_request (128-158)

tests/kvbm_integration/test_cuda_graph.py (1)

send_completion_request (111-141)

tests/conftest.py (1)

tests/utils/port_utils.py (4)

allocate_free_port (146-155)

allocate_free_ports (59-143)

free_port (191-197)

free_ports (158-188)

tests/fault_tolerance/migration/test_trtllm.py (4)

tests/utils/port_utils.py (1)

allocate_free_port (146-155)

tests/fault_tolerance/cancellation/test_vllm.py (1)

DynamoWorkerProcess (40-147)

tests/conftest.py (3)

runtime_services_dynamic_ports (504-527)

set_ucx_tls_no_mm (78-89)

logger (190-201)

tests/fault_tolerance/migration/utils.py (2)

DynamoFrontendProcess (19-60)

start_completion_request (63-107)

tests/fault_tolerance/migration/utils.py (1)

tests/utils/managed_process.py (1)

ManagedProcess (71-579)

tests/fault_tolerance/migration/test_vllm.py (3)

tests/utils/managed_process.py (3)

ManagedProcess (71-579)

DynamoFrontendProcess (582-610)

get_pid (562-564)

tests/utils/port_utils.py (1)

allocate_free_port (146-155)

tests/fault_tolerance/migration/utils.py (2)

DynamoFrontendProcess (19-60)

start_completion_request (63-107)

tests/fault_tolerance/cancellation/test_vllm.py (2)

tests/utils/port_utils.py (1)

allocate_free_port (146-155)

tests/fault_tolerance/cancellation/utils.py (2)

DynamoFrontendProcess (22-57)

send_cancellable_request (251-277)

🪛 Ruff (0.14.8)

tests/utils/port_utils.py

76-78: Avoid specifying long messages outside the exception class

(TRY003)

85-87: Avoid specifying long messages outside the exception class

(TRY003)

132-134: Avoid specifying long messages outside the exception class

(TRY003)

tests/fault_tolerance/cancellation/test_trtllm.py

154-154: Unused function argument: runtime_services_dynamic_ports

(ARG001)

245-245: Unused function argument: runtime_services_dynamic_ports

(ARG001)

340-340: Unused function argument: runtime_services_dynamic_ports

(ARG001)

447-447: Unused function argument: runtime_services_dynamic_ports

(ARG001)

tests/conftest.py

315-315: Do not catch blind exception: Exception

(BLE001)

319-319: return inside finally blocks cause exceptions to be silenced

(B012)

357-357: Do not catch blind exception: Exception

(BLE001)

361-361: return inside finally blocks cause exceptions to be silenced

(B012)

tests/fault_tolerance/migration/test_trtllm.py

126-126: Unused function argument: runtime_services_dynamic_ports

(ARG001)

126-126: Unused function argument: set_ucx_tls_no_mm

(ARG001)

190-190: Unused function argument: runtime_services_dynamic_ports

(ARG001)

190-190: Unused function argument: set_ucx_tls_no_mm

(ARG001)

258-258: Unused function argument: runtime_services_dynamic_ports

(ARG001)

258-258: Unused function argument: set_ucx_tls_no_mm

(ARG001)

340-340: Unused function argument: runtime_services_dynamic_ports

(ARG001)

340-340: Unused function argument: set_ucx_tls_no_mm

(ARG001)

tests/fault_tolerance/migration/test_sglang.py

121-121: Do not catch blind exception: Exception

(BLE001)

125-125: return inside finally blocks cause exceptions to be silenced

(B012)

144-144: Unused function argument: runtime_services_dynamic_ports

(ARG001)

144-144: Unused function argument: set_ucx_tls_no_mm

(ARG001)

209-209: Unused function argument: runtime_services_dynamic_ports

(ARG001)

209-209: Unused function argument: set_ucx_tls_no_mm

(ARG001)

277-277: Unused function argument: runtime_services_dynamic_ports

(ARG001)

277-277: Unused function argument: set_ucx_tls_no_mm

(ARG001)

360-360: Unused function argument: runtime_services_dynamic_ports

(ARG001)

360-360: Unused function argument: set_ucx_tls_no_mm

(ARG001)

tests/fault_tolerance/cancellation/test_sglang.py

162-162: Do not catch blind exception: Exception

(BLE001)

166-166: return inside finally blocks cause exceptions to be silenced

(B012)

189-189: Unused function argument: runtime_services_dynamic_ports

(ARG001)

292-292: Unused function argument: runtime_services_dynamic_ports

(ARG001)

🔇 Additional comments (22)

tests/utils/managed_process.py (1)

562-564: LGTM! Clean API addition for PID access.

The implementation correctly handles the case where self.proc may be None, and moving this to the base class promotes code reuse across subclasses.
tests/fault_tolerance/cancellation/test_vllm.py (2)
4-10: Good practice: Timing documentation.

The timing annotations provide valuable context for understanding expected test durations and help identify regressions. This is a helpful pattern for fault tolerance tests.

168-170: Allocated ports are not explicitly released after test completion.

The ports allocated via allocate_free_port() (frontend_port, system_port) are registered in the port registry but never explicitly freed when the test completes. Cleanup relies on the 15-minute stale allocation timeout, which can lead to port exhaustion in parallel test runs.

Consider wrapping port allocation and cleanup in a helper or using a try/finally block:
from tests.utils.port_utils import allocate_free_port, free_ports

frontend_port = allocate_free_port(8100)
system_port = allocate_free_port(9100)
try:
    # ... test body ...
finally:
    free_ports([frontend_port, system_port])
Alternatively, create a pytest fixture that handles port allocation and cleanup automatically.
⛔ Skipped due to learnings
Learnt from: tzulingk
Repo: ai-dynamo/dynamo PR: 3194
File: tests/utils/managed_deployment.py:620-656
Timestamp: 2025-09-24T17:45:54.735Z
Learning: In the fault tolerance test infrastructure for Dynamo, port-forwards created by the `port_forward` method in `ManagedDeployment` are intended to run in the background and should not be stopped immediately after use. They are designed for persistent connections that can be reused across multiple metrics collections during test execution.
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3193
File: tests/fault_tolerance/cancellation/test_trtllm.py:4-204
Timestamp: 2025-09-25T00:54:01.369Z
Learning: The fault tolerance tests in tests/fault_tolerance/cancellation/ run in a controlled container environment where files written to /workspace are automatically cleaned up after test completion, and tests execute sequentially without concurrency concerns, so temporary file management for config files is not necessary.
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 4323
File: components/src/dynamo/vllm/main.py:197-205
Timestamp: 2025-11-14T01:09:35.244Z
Learning: In components/src/dynamo/vllm/main.py, keivenchang considers temporary directory cleanup from tempfile.mkdtemp() to be LOW PRIORITY for production because containerized deployment patterns automatically clean up temp directories when containers are destroyed, mitigating resource leak concerns.
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3391
File: tests/fault_tolerance/cancellation/utils.py:323-390
Timestamp: 2025-10-03T01:53:15.023Z
Learning: In `tests/fault_tolerance/cancellation/utils.py`, the `poll_for_pattern` function's default `max_wait_ms` of 500ms is intentionally set to detect failures in cancellation signal propagation to TRT-LLM. This timeout covers only the time for the cancellation signal to reach TRT-LLM (not any generation routine), and if cancellation takes longer than 0.5s to propagate, it should be considered a test failure.
tests/fault_tolerance/cancellation/utils.py (2)

22-58: LGTM: DynamoFrontendProcess port parameterization.

The frontend process is correctly parameterized with the dynamic port, and terminate_existing=False ensures parallel tests don't interfere with each other.

293-301: LGTM: Improved response validation.

The explicit None check and status code validation before using the response is good defensive programming. The cast() usage appropriately narrows the type after the validation checks.

tests/fault_tolerance/migration/utils.py (3)

171-176: Explicit raise AssertionError after pytest.fail() for type checker.

The raise AssertionError("Unreachable") statements after pytest.fail() are necessary for mypy to understand that the function returns tuple[ManagedProcess, str] in all code paths, since mypy doesn't recognize that pytest.fail() never returns.

This addresses a past review comment suggesting removal, but the explicit raise is the correct approach for satisfying type checkers.

22-61: LGTM: DynamoFrontendProcess port parameterization.

The implementation is consistent with the cancellation utils pattern. The --router-mode round-robin flag is appropriate for migration tests that need deterministic request routing.

63-107: LGTM: start_completion_request port parameterization.

The function correctly accepts and uses the frontend_port parameter for constructing the request URL, enabling dynamic port allocation in tests.

tests/fault_tolerance/migration/test_vllm.py (2)

4-11: Timing documentation is helpful for test maintenance.

The standardized timing annotations provide useful context for test execution expectations and help identify performance regressions.

146-184: Dynamic port allocation and wiring looks correct.

The port allocation pattern using allocate_free_port(start_port) and passing ports to DynamoFrontendProcess and DynamoWorkerProcess is consistent with the PR objectives for test isolation.

tests/fault_tolerance/migration/test_trtllm.py (2)

4-11: Timing documentation with skip annotations is clear.

Good practice to note which tests are skipped in the timing summary.

124-185: Port allocation and wiring implemented correctly.

The dynamic port allocation using allocate_free_port() and explicit port passing to worker/frontend processes aligns with PR objectives. The use of runtime_services_dynamic_ports fixture is correct for this test.

tests/fault_tolerance/cancellation/test_sglang.py (2)

4-9: Clear timing documentation with GPU requirements noted.

The timing annotations help understand resource requirements and expected execution time.

188-224: Test setup with dynamic ports looks correct.

The port allocation and worker initialization pattern is clean and follows the established convention.

tests/fault_tolerance/migration/test_sglang.py (2)

4-11: Timing documentation includes skip status for unenabled tests.

Clear documentation of which tests are skipped and total time for enabled tests.

142-203: Port allocation and test orchestration implemented correctly.

Dynamic port allocation pattern is consistent with other tests. The use of both runtime_services_dynamic_ports and set_ucx_tls_no_mm fixtures is appropriate for SGLang migration tests.

tests/fault_tolerance/cancellation/test_trtllm.py (6)

4-11: Comprehensive timing documentation for multiple test scenarios.

Good coverage of timing for aggregated, decode, prefill, and KV transfer cancellation tests.

42-131: Port wiring and health check logic implemented correctly.

The DynamoWorkerProcess correctly differentiates health checks based on mode:

prefill/decode: Uses system port health endpoint only

prefill_and_decode: Uses frontend endpoints for models and health

The port parameters are properly wired to environment variables and health check URLs.

152-240: Aggregated mode cancellation test properly allocates and uses dynamic ports.

The test correctly allocates frontend and system ports, passes them to processes, and uses them for request dispatch.

243-335: Disaggregated decode cancellation test with proper port isolation.

The test allocates separate ports for frontend, prefill worker, and decode worker, ensuring proper isolation. The worker startup order (frontend → prefill → decode) and health check wiring is correct.

338-438: Prefill cancellation test verifies decode worker never received the cancelled request.

Good test design: validates that cancellation during prefill prevents the request from reaching the decode worker.

441-551: KV transfer cancellation test includes worker functionality verification after cancellation.

Good practice to verify workers remain functional after a cancellation event (lines 538-548).

tests/fault_tolerance/cancellation/test_sglang.py

tests/fault_tolerance/migration/test_sglang.py

tests/fault_tolerance/cancellation/test_sglang.py

tests/utils/port_utils.py

keivenchang · 2025-12-10T23:43:43Z

Updated:

Lower _PORT_MIN from 30000 to 1024 for wider port range
Add random offset (0-100) to start_port to reduce initial collision
Wrap around to _PORT_MIN instead of start_port
Reduce retries from full range scan to 100 attempts

tests/utils/port_utils.py

Add port allocation utilities and update fault tolerance tests (vLLM, SGLang, TRT-LLM) to use dynamically allocated ports, enabling parallel test execution. - Add tests/utils/port_utils.py for dynamic port allocation - Update conftest.py: EtcdServer/NatsServer support dynamic ports - Move get_pid() to ManagedProcess base class - Update vLLM tests with dynamic port allocation and timing docs - Update SGLang tests with dynamic port allocation and timing docs - Update TRT-LLM tests with dynamic port allocation and timing docs Signed-off-by: Keiven Chang <[email protected]> Remove redundant comments and improve port allocation - Remove redundant 'Always call parent __exit__' comments - Fix return in finally blocks to avoid swallowing exceptions - Update port allocation to search entire valid range - Add validated timing benchmarks to port_utils.py Signed-off-by: Keiven Chang <[email protected]> Use runtime_services_dynamic_ports for vLLM tests Update vLLM cancellation and migration tests to use runtime_services_dynamic_ports fixture for fully dynamic NATS/Etcd ports, enabling true parallel test execution. - Restore max_tokens to 16384 in send_cancellable_request - Add TODO to etcd_ha/test_sglang.py for future port update Signed-off-by: Keiven Chang <[email protected]> Rename port functions and simplify cleanup logic Rename for clarity and consistency: - allocate_free_port -> allocate_port - allocate_free_ports -> allocate_ports - free_port -> deallocate_port - free_ports -> deallocate_ports Simplify __exit__ methods with comments explaining why checks are/aren't needed: - self.port always set in __init__ (no None check needed) - self.peer_port may be None (check required) - self.system_port required parameter (no None check needed) Signed-off-by: Keiven Chang <[email protected]> Move frontend port allocation into DynamoFrontendProcess Signed-off-by: Keiven Chang <[email protected]> Improve port allocation with random offset and caller tracking - Lower _PORT_MIN from 30000 to 1024 for wider port range - Add random offset (0-100) to start_port to reduce collision - Wrap around to _PORT_MIN instead of start_port - Reduce retries from full range scan to 100 attempts - Track caller info (file, function, line) in lock file for debugging - Handle both old and new registry formats for compatibility Signed-off-by: Keiven Chang <[email protected]>

Signed-off-by: Keiven Chang <[email protected]>

…i-dynamo#4835) Signed-off-by: Keiven Chang <[email protected]> Co-authored-by: Keiven Chang <[email protected]>

keivenchang requested a review from kthui December 10, 2025 00:12

keivenchang self-assigned this Dec 10, 2025

pull-request-size bot added the size/XXL label Dec 10, 2025

github-actions bot added the test label Dec 10, 2025

keivenchang force-pushed the keivenchang/python-tests-use-dynamo-port-part2 branch from 88570b6 to 909fb97 Compare December 10, 2025 00:20

copy-pr-bot bot had a problem deploying to GITLAB December 10, 2025 00:20 Failure

keivenchang changed the title ~~test: add dynamic port allocation for parallel test execution~~ test: add dynamic port allocation for fault_tolerant test execution Dec 10, 2025

kthui reviewed Dec 10, 2025

View reviewed changes

keivenchang marked this pull request as ready for review December 10, 2025 02:46

keivenchang requested review from a team as code owners December 10, 2025 02:46

coderabbitai bot reviewed Dec 10, 2025

View reviewed changes

tests/fault_tolerance/cancellation/test_sglang.py Outdated Show resolved Hide resolved

tests/fault_tolerance/migration/test_sglang.py Outdated Show resolved Hide resolved

copy-pr-bot bot had a problem deploying to GITLAB December 10, 2025 03:51 Failure

nnshah1 reviewed Dec 10, 2025

View reviewed changes

tests/fault_tolerance/cancellation/test_sglang.py Show resolved Hide resolved

nnshah1 reviewed Dec 10, 2025

View reviewed changes

tests/fault_tolerance/cancellation/test_sglang.py Outdated Show resolved Hide resolved

nnshah1 reviewed Dec 10, 2025

View reviewed changes

tests/fault_tolerance/cancellation/test_sglang.py Outdated Show resolved Hide resolved

nnshah1 reviewed Dec 10, 2025

View reviewed changes

tests/fault_tolerance/cancellation/test_sglang.py Outdated Show resolved Hide resolved

copy-pr-bot bot had a problem deploying to GITLAB December 10, 2025 05:50 Failure

copy-pr-bot bot had a problem deploying to GITLAB December 10, 2025 07:02 Failure

copy-pr-bot bot had a problem deploying to GITLAB December 10, 2025 08:38 Failure

kthui reviewed Dec 10, 2025

View reviewed changes

tests/utils/port_utils.py Outdated Show resolved Hide resolved

copy-pr-bot bot temporarily deployed to GITLAB December 10, 2025 23:43 Inactive

copy-pr-bot bot temporarily deployed to GITLAB December 10, 2025 23:44 Inactive

kthui reviewed Dec 11, 2025

View reviewed changes

tests/utils/port_utils.py Show resolved Hide resolved

kthui approved these changes Dec 11, 2025

View reviewed changes

keivenchang added 2 commits December 11, 2025 02:36

Use runtime_services_dynamic_ports in fault tolerance tests

338bbc1

Signed-off-by: Keiven Chang <[email protected]>

keivenchang force-pushed the keivenchang/python-tests-use-dynamo-port-part2 branch from fc4528c to 338bbc1 Compare December 11, 2025 03:59

copy-pr-bot bot had a problem deploying to GITLAB December 11, 2025 03:59 Failure

Fix port deallocation to only deallocate dynamically allocated ports

d3e6d7e

Signed-off-by: Keiven Chang <[email protected]>

copy-pr-bot bot had a problem deploying to GITLAB December 11, 2025 04:02 Failure

mark disaggregated_same_gpu as skip

9b0e0af

copy-pr-bot bot had a problem deploying to GITLAB December 11, 2025 05:42 Failure

keivenchang merged commit c6b440e into main Dec 11, 2025
29 of 30 checks passed

keivenchang deleted the keivenchang/python-tests-use-dynamo-port-part2 branch December 11, 2025 07:11

smatta-star pushed a commit to smatta-star/dynamo that referenced this pull request Dec 19, 2025

test: add dynamic port allocation for fault_tolerant test execution (a…

5ff2e13

…i-dynamo#4835) Signed-off-by: Keiven Chang <[email protected]> Co-authored-by: Keiven Chang <[email protected]>

test: add dynamic port allocation for fault_tolerant test execution #4835

test: add dynamic port allocation for fault_tolerant test execution #4835

Uh oh!

Conversation

keivenchang commented Dec 10, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues:

Summary by CodeRabbit

Tests

Uh oh!

kthui left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kthui commented Dec 10, 2025

Uh oh!

coderabbitai bot commented Dec 10, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

keivenchang commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

keivenchang commented Dec 10, 2025 •

edited by coderabbitai bot

Loading

keivenchang commented Dec 10, 2025 •

edited

Loading