-
Notifications
You must be signed in to change notification settings - Fork 752
test: add dynamic port allocation for fault_tolerant test execution #4835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: add dynamic port allocation for fault_tolerant test execution #4835
Conversation
88570b6 to
909fb97
Compare
kthui
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me overall! This is a great enhancement to stop port conflicts between test cases.
Left a few minor comments, but the accidental 16384 -> 8192 prompt length changes would need to be fixed before merging.
|
FYI, all the Cancellation / Migration tests are not ran as a part of |
WalkthroughThis pull request introduces dynamic port allocation for parallel test execution by adding flock-based port utilities, updating core test fixtures and process management, and refactoring multiple test components to accept explicit port parameters instead of hardcoded values. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Areas requiring extra attention:
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tests/fault_tolerance/cancellation/test_vllm.py (1)
96-97: Hardcoded ports are inconsistent with dynamic port allocation strategy used elsewhere in the test.
DYN_VLLM_KV_EVENT_PORT(20082) andVLLM_NIXL_SIDE_CHANNEL_PORT(5601) are hardcoded, whilefrontend_port,prefill_system_port, anddecode_system_portare dynamically allocated with comments stating "Allocate ports to avoid conflicts with parallel tests." This inconsistency means if parallel execution is enabled in the future (e.g., viapytest.mark.parallel), these hardcoded ports could cause conflicts between concurrent test runs. Align the port allocation strategy by making these ports dynamic as well, usingallocate_free_port()to match the pattern used for other ports.tests/fault_tolerance/migration/test_vllm.py (1)
129-144: Consider usingruntime_services_dynamic_portsfor consistency with TRT-LLM and SGLang migration tests.The TRT-LLM and SGLang migration tests use
runtime_services_dynamic_ports, which automatically sets NATS_SERVER and ETCD_ENDPOINTS environment variables for dynamic service discovery. The vLLM migration tests use the olderruntime_servicesfixture, which does not configure these environment variables. Since worker processes inherit os.environ and may depend on these variables for NATS/Etcd connectivity, updating vLLM to useruntime_services_dynamic_portswould align with the other frameworks and ensure consistent behavior across all migration tests.
♻️ Duplicate comments (3)
tests/utils/port_utils.py (1)
75-78: Validation uses1024instead of_PORT_MIN.The docstring and comments indicate the port range is 30000-32767, but the validation allows any port >= 1024. This is inconsistent with the documented i16 range constraint.
This was already flagged in a past review comment by kthui suggesting to use
_PORT_MINinstead of1024.tests/fault_tolerance/cancellation/utils.py (1)
271-275: max_tokens reduced from 16384 to 8192.Past review comments from kthui indicate that max_tokens values were intentionally set to 16384 for faster machines and should not be changed.
This was already flagged in past review comments. Verify if this reduction was intentional or if it should be reverted to 16384.
tests/fault_tolerance/cancellation/test_sglang.py (1)
309-312: Past review addressed: port allocation pattern.A previous reviewer suggested consolidating the three
allocate_free_portcalls into oneallocate_free_ports(3, 30000)call. This is a valid simplification if the team prefers it.
🧹 Nitpick comments (8)
tests/conftest.py (4)
303-319:returninsidefinallycan silently swallow exceptions.The
return super().__exit__(...)inside thefinallyblock will suppress any exception raised in thetryblock. Iffree_ports()raises an unexpected error that isn't caught (e.g., aKeyboardInterrupt), it will be silently discarded.Consider storing the result and returning after the
finallyblock completes:def __exit__(self, exc_type, exc_val, exc_tb): """Release allocated ports when server exits.""" ports_to_release = [] try: # Release allocated ports BEFORE calling parent __exit__ if hasattr(self, "port") and self.port is not None: ports_to_release.append(self.port) if hasattr(self, "peer_port") and self.peer_port is not None: ports_to_release.append(self.peer_port) if ports_to_release: free_ports(ports_to_release) except Exception as e: logging.warning(f"Failed to release EtcdServer port: {e}") - finally: - # Always call parent __exit__ to terminate the process - return super().__exit__(exc_type, exc_val, exc_tb) + # Always call parent __exit__ to terminate the process + return super().__exit__(exc_type, exc_val, exc_tb)The
finallyblock is unnecessary here since theexceptclause handles the only expected failure path, and moving the parent call outside ensures exceptions propagate correctly.
351-361: Samereturninsidefinallyissue asEtcdServer.__exit__.Apply the same fix here for consistency:
def __exit__(self, exc_type, exc_val, exc_tb): """Release allocated port when server exits.""" try: # Release allocated port BEFORE calling parent __exit__ if hasattr(self, "port") and self.port is not None: free_port(self.port) except Exception as e: logging.warning(f"Failed to release NatsServer port: {e}") - finally: - # Always call parent __exit__ to terminate the process - return super().__exit__(exc_type, exc_val, exc_tb) + # Always call parent __exit__ to terminate the process + return super().__exit__(exc_type, exc_val, exc_tb)
513-513: Redundant import ofos.
osis already imported at the top of the file (line 5). This local import is unnecessary.- import os - # Port cleanup is now handled in NatsServer and EtcdServer __exit__ methods
351-361: Port is released even when not dynamically allocated.When using the default port (e.g.,
NatsServer(request, port=4222)),free_port(4222)is called even though port 4222 was never registered in the allocation registry. While this is benign (the registry lookup is a no-op for unregistered ports), it's semantically cleaner to only free ports that were dynamically allocated.Consider tracking whether the port was dynamically allocated:
def __init__(self, request, port=4222, timeout=300): # Allocate a free port if port is None or 0 use_random_port = port == 0 if use_random_port: # Start from 4223 (nats-server default 4222 + 1) port = allocate_free_port(4223) self.port = port + self._dynamically_allocated = use_random_port ... def __exit__(self, exc_type, exc_val, exc_tb): """Release allocated port when server exits.""" try: - if hasattr(self, "port") and self.port is not None: + if getattr(self, "_dynamically_allocated", False) and self.port is not None: free_port(self.port) ...The same pattern could apply to
EtcdServer.tests/utils/port_utils.py (2)
122-129: Socket file descriptor may leak on bind failure.When
sock.bind()raisesOSError, the socket is not closed before continuing to the next iteration. While Python's garbage collector will eventually reclaim it, explicitly closing the socket ensures deterministic cleanup, especially during rapid port scanning.# Try to bind to verify it's actually free try: sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) - sock.bind(("", port)) - sock.close() - ports.append(port) - registry[str(port)] = time.time() + try: + sock.bind(("", port)) + ports.append(port) + registry[str(port)] = time.time() + finally: + sock.close() except OSError: continue
49-56: Consider documenting the stale allocation timeout.The 15-minute (
max_age=900.0) stale allocation cleanup is a reasonable default, but tests that run longer or crash unexpectedly may hit port exhaustion. Consider adding this detail to the module docstring or function docstring for visibility.tests/fault_tolerance/migration/test_vllm.py (1)
44-112: Missing port cleanup in__exit__— inconsistent with SGLang tests.The SGLang worker processes (
test_sglang.pyin both cancellation and migration) implement__exit__to release allocated ports viafree_port(). This class lacks that cleanup, which could cause port leaks if the allocation tracking relies on explicit release.If
allocate_free_porttracks allocated ports (via file lock or registry), add cleanup:+from tests.utils.port_utils import free_port + class DynamoWorkerProcess(ManagedProcess): ... + def __init__(self, ...): + ... + self.system_port = system_port + ... + + def __exit__(self, exc_type, exc_val, exc_tb): + """Release allocated port when worker exits.""" + if hasattr(self, "system_port") and self.system_port is not None: + try: + free_port(self.system_port) + except Exception: + pass + return super().__exit__(exc_type, exc_val, exc_tb)Alternatively, if port cleanup is handled elsewhere (e.g., fixture teardown), this can be skipped — but consistency across test files would improve maintainability.
tests/fault_tolerance/migration/test_trtllm.py (1)
44-107: Missing port cleanup in__exit__— same issue as vLLM tests.For consistency with SGLang tests that release ports in
__exit__, consider adding cleanup here as well.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
tests/conftest.py(3 hunks)tests/fault_tolerance/cancellation/test_sglang.py(11 hunks)tests/fault_tolerance/cancellation/test_trtllm.py(14 hunks)tests/fault_tolerance/cancellation/test_vllm.py(11 hunks)tests/fault_tolerance/cancellation/utils.py(9 hunks)tests/fault_tolerance/migration/test_sglang.py(11 hunks)tests/fault_tolerance/migration/test_trtllm.py(10 hunks)tests/fault_tolerance/migration/test_vllm.py(8 hunks)tests/fault_tolerance/migration/utils.py(5 hunks)tests/utils/managed_process.py(1 hunks)tests/utils/port_utils.py(1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: tzulingk
Repo: ai-dynamo/dynamo PR: 3194
File: tests/utils/managed_deployment.py:620-656
Timestamp: 2025-09-24T17:45:54.735Z
Learning: In the fault tolerance test infrastructure for Dynamo, port-forwards created by the `port_forward` method in `ManagedDeployment` are intended to run in the background and should not be stopped immediately after use. They are designed for persistent connections that can be reused across multiple metrics collections during test execution.
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 2989
File: lib/llm/src/block_manager/distributed/transfer.rs:6-6
Timestamp: 2025-09-18T21:47:44.143Z
Learning: For PR ai-dynamo/dynamo#2989, the ConnectorTransferBatcher architectural issues will be addressed in a follow-up PR by removing the duplicate batching logic and integrating distributed transfers with the existing TransferBatcher + LocalTransferManager pipeline, rather than adding bounded concurrency primitives like Semaphore.
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3051
File: container/templates/Dockerfile.trtllm.j2:424-437
Timestamp: 2025-09-16T17:16:03.785Z
Learning: keivenchang prioritizes maintaining exact backward compatibility during migration/refactoring PRs, even when bugs are identified in the original code. Fixes should be deferred to separate PRs after the migration is complete.
📚 Learning: 2025-09-24T17:45:54.735Z
Learnt from: tzulingk
Repo: ai-dynamo/dynamo PR: 3194
File: tests/utils/managed_deployment.py:620-656
Timestamp: 2025-09-24T17:45:54.735Z
Learning: In the fault tolerance test infrastructure for Dynamo, port-forwards created by the `port_forward` method in `ManagedDeployment` are intended to run in the background and should not be stopped immediately after use. They are designed for persistent connections that can be reused across multiple metrics collections during test execution.
Applied to files:
tests/fault_tolerance/cancellation/utils.pytests/fault_tolerance/cancellation/test_trtllm.pytests/fault_tolerance/migration/test_trtllm.pytests/fault_tolerance/migration/test_sglang.pytests/fault_tolerance/cancellation/test_sglang.pytests/fault_tolerance/migration/utils.pytests/fault_tolerance/migration/test_vllm.pytests/fault_tolerance/cancellation/test_vllm.py
📚 Learning: 2025-10-03T01:53:15.023Z
Learnt from: kthui
Repo: ai-dynamo/dynamo PR: 3391
File: tests/fault_tolerance/cancellation/utils.py:323-390
Timestamp: 2025-10-03T01:53:15.023Z
Learning: In `tests/fault_tolerance/cancellation/utils.py`, the `poll_for_pattern` function's default `max_wait_ms` of 500ms is intentionally set to detect failures in cancellation signal propagation to TRT-LLM. This timeout covers only the time for the cancellation signal to reach TRT-LLM (not any generation routine), and if cancellation takes longer than 0.5s to propagate, it should be considered a test failure.
Applied to files:
tests/fault_tolerance/cancellation/test_trtllm.py
📚 Learning: 2025-09-24T17:26:17.225Z
Learnt from: tzulingk
Repo: ai-dynamo/dynamo PR: 3194
File: tests/fault_tolerance/deploy/client.py:196-216
Timestamp: 2025-09-24T17:26:17.225Z
Learning: In tests/fault_tolerance/deploy/client.py, when no pods are ready, the port defaults to 0, creating URL "http://localhost:0/{endpoint}". The requests.post() call will raise ConnectionError or ConnectTimeout, which are caught by requests.RequestException in _single_request function.
Applied to files:
tests/fault_tolerance/cancellation/test_trtllm.pytests/fault_tolerance/cancellation/test_sglang.pytests/fault_tolerance/migration/utils.pytests/fault_tolerance/cancellation/test_vllm.py
🧬 Code graph analysis (6)
tests/fault_tolerance/cancellation/utils.py (2)
tests/fault_tolerance/test_vllm_health_check.py (1)
send_completion_request(128-158)tests/kvbm_integration/test_cuda_graph.py (1)
send_completion_request(111-141)
tests/conftest.py (1)
tests/utils/port_utils.py (4)
allocate_free_port(146-155)allocate_free_ports(59-143)free_port(191-197)free_ports(158-188)
tests/fault_tolerance/migration/test_trtllm.py (4)
tests/utils/port_utils.py (1)
allocate_free_port(146-155)tests/fault_tolerance/cancellation/test_vllm.py (1)
DynamoWorkerProcess(40-147)tests/conftest.py (3)
runtime_services_dynamic_ports(504-527)set_ucx_tls_no_mm(78-89)logger(190-201)tests/fault_tolerance/migration/utils.py (2)
DynamoFrontendProcess(19-60)start_completion_request(63-107)
tests/fault_tolerance/migration/utils.py (1)
tests/utils/managed_process.py (1)
ManagedProcess(71-579)
tests/fault_tolerance/migration/test_vllm.py (3)
tests/utils/managed_process.py (3)
ManagedProcess(71-579)DynamoFrontendProcess(582-610)get_pid(562-564)tests/utils/port_utils.py (1)
allocate_free_port(146-155)tests/fault_tolerance/migration/utils.py (2)
DynamoFrontendProcess(19-60)start_completion_request(63-107)
tests/fault_tolerance/cancellation/test_vllm.py (2)
tests/utils/port_utils.py (1)
allocate_free_port(146-155)tests/fault_tolerance/cancellation/utils.py (2)
DynamoFrontendProcess(22-57)send_cancellable_request(251-277)
🪛 Ruff (0.14.8)
tests/utils/port_utils.py
76-78: Avoid specifying long messages outside the exception class
(TRY003)
85-87: Avoid specifying long messages outside the exception class
(TRY003)
132-134: Avoid specifying long messages outside the exception class
(TRY003)
tests/fault_tolerance/cancellation/test_trtllm.py
154-154: Unused function argument: runtime_services_dynamic_ports
(ARG001)
245-245: Unused function argument: runtime_services_dynamic_ports
(ARG001)
340-340: Unused function argument: runtime_services_dynamic_ports
(ARG001)
447-447: Unused function argument: runtime_services_dynamic_ports
(ARG001)
tests/conftest.py
315-315: Do not catch blind exception: Exception
(BLE001)
319-319: return inside finally blocks cause exceptions to be silenced
(B012)
357-357: Do not catch blind exception: Exception
(BLE001)
361-361: return inside finally blocks cause exceptions to be silenced
(B012)
tests/fault_tolerance/migration/test_trtllm.py
126-126: Unused function argument: runtime_services_dynamic_ports
(ARG001)
126-126: Unused function argument: set_ucx_tls_no_mm
(ARG001)
190-190: Unused function argument: runtime_services_dynamic_ports
(ARG001)
190-190: Unused function argument: set_ucx_tls_no_mm
(ARG001)
258-258: Unused function argument: runtime_services_dynamic_ports
(ARG001)
258-258: Unused function argument: set_ucx_tls_no_mm
(ARG001)
340-340: Unused function argument: runtime_services_dynamic_ports
(ARG001)
340-340: Unused function argument: set_ucx_tls_no_mm
(ARG001)
tests/fault_tolerance/migration/test_sglang.py
121-121: Do not catch blind exception: Exception
(BLE001)
125-125: return inside finally blocks cause exceptions to be silenced
(B012)
144-144: Unused function argument: runtime_services_dynamic_ports
(ARG001)
144-144: Unused function argument: set_ucx_tls_no_mm
(ARG001)
209-209: Unused function argument: runtime_services_dynamic_ports
(ARG001)
209-209: Unused function argument: set_ucx_tls_no_mm
(ARG001)
277-277: Unused function argument: runtime_services_dynamic_ports
(ARG001)
277-277: Unused function argument: set_ucx_tls_no_mm
(ARG001)
360-360: Unused function argument: runtime_services_dynamic_ports
(ARG001)
360-360: Unused function argument: set_ucx_tls_no_mm
(ARG001)
tests/fault_tolerance/cancellation/test_sglang.py
162-162: Do not catch blind exception: Exception
(BLE001)
166-166: return inside finally blocks cause exceptions to be silenced
(B012)
189-189: Unused function argument: runtime_services_dynamic_ports
(ARG001)
292-292: Unused function argument: runtime_services_dynamic_ports
(ARG001)
🔇 Additional comments (22)
tests/utils/managed_process.py (1)
562-564: LGTM! Clean API addition for PID access.The implementation correctly handles the case where
self.procmay beNone, and moving this to the base class promotes code reuse across subclasses.tests/fault_tolerance/cancellation/test_vllm.py (2)
4-10: Good practice: Timing documentation.The timing annotations provide valuable context for understanding expected test durations and help identify regressions. This is a helpful pattern for fault tolerance tests.
168-170: Allocated ports are not explicitly released after test completion.The ports allocated via
allocate_free_port()(frontend_port, system_port) are registered in the port registry but never explicitly freed when the test completes. Cleanup relies on the 15-minute stale allocation timeout, which can lead to port exhaustion in parallel test runs.Consider wrapping port allocation and cleanup in a helper or using a try/finally block:
from tests.utils.port_utils import allocate_free_port, free_ports frontend_port = allocate_free_port(8100) system_port = allocate_free_port(9100) try: # ... test body ... finally: free_ports([frontend_port, system_port])Alternatively, create a pytest fixture that handles port allocation and cleanup automatically.
⛔ Skipped due to learnings
Learnt from: tzulingk Repo: ai-dynamo/dynamo PR: 3194 File: tests/utils/managed_deployment.py:620-656 Timestamp: 2025-09-24T17:45:54.735Z Learning: In the fault tolerance test infrastructure for Dynamo, port-forwards created by the `port_forward` method in `ManagedDeployment` are intended to run in the background and should not be stopped immediately after use. They are designed for persistent connections that can be reused across multiple metrics collections during test execution.Learnt from: kthui Repo: ai-dynamo/dynamo PR: 3193 File: tests/fault_tolerance/cancellation/test_trtllm.py:4-204 Timestamp: 2025-09-25T00:54:01.369Z Learning: The fault tolerance tests in tests/fault_tolerance/cancellation/ run in a controlled container environment where files written to /workspace are automatically cleaned up after test completion, and tests execute sequentially without concurrency concerns, so temporary file management for config files is not necessary.Learnt from: keivenchang Repo: ai-dynamo/dynamo PR: 4323 File: components/src/dynamo/vllm/main.py:197-205 Timestamp: 2025-11-14T01:09:35.244Z Learning: In components/src/dynamo/vllm/main.py, keivenchang considers temporary directory cleanup from tempfile.mkdtemp() to be LOW PRIORITY for production because containerized deployment patterns automatically clean up temp directories when containers are destroyed, mitigating resource leak concerns.Learnt from: kthui Repo: ai-dynamo/dynamo PR: 3391 File: tests/fault_tolerance/cancellation/utils.py:323-390 Timestamp: 2025-10-03T01:53:15.023Z Learning: In `tests/fault_tolerance/cancellation/utils.py`, the `poll_for_pattern` function's default `max_wait_ms` of 500ms is intentionally set to detect failures in cancellation signal propagation to TRT-LLM. This timeout covers only the time for the cancellation signal to reach TRT-LLM (not any generation routine), and if cancellation takes longer than 0.5s to propagate, it should be considered a test failure.tests/fault_tolerance/cancellation/utils.py (2)
22-58: LGTM: DynamoFrontendProcess port parameterization.The frontend process is correctly parameterized with the dynamic port, and
terminate_existing=Falseensures parallel tests don't interfere with each other.
293-301: LGTM: Improved response validation.The explicit None check and status code validation before using the response is good defensive programming. The
cast()usage appropriately narrows the type after the validation checks.tests/fault_tolerance/migration/utils.py (3)
171-176: Explicitraise AssertionErrorafterpytest.fail()for type checker.The
raise AssertionError("Unreachable")statements afterpytest.fail()are necessary for mypy to understand that the function returnstuple[ManagedProcess, str]in all code paths, since mypy doesn't recognize thatpytest.fail()never returns.This addresses a past review comment suggesting removal, but the explicit raise is the correct approach for satisfying type checkers.
22-61: LGTM: DynamoFrontendProcess port parameterization.The implementation is consistent with the cancellation utils pattern. The
--router-mode round-robinflag is appropriate for migration tests that need deterministic request routing.
63-107: LGTM: start_completion_request port parameterization.The function correctly accepts and uses the
frontend_portparameter for constructing the request URL, enabling dynamic port allocation in tests.tests/fault_tolerance/migration/test_vllm.py (2)
4-11: Timing documentation is helpful for test maintenance.The standardized timing annotations provide useful context for test execution expectations and help identify performance regressions.
146-184: Dynamic port allocation and wiring looks correct.The port allocation pattern using
allocate_free_port(start_port)and passing ports toDynamoFrontendProcessandDynamoWorkerProcessis consistent with the PR objectives for test isolation.tests/fault_tolerance/migration/test_trtllm.py (2)
4-11: Timing documentation with skip annotations is clear.Good practice to note which tests are skipped in the timing summary.
124-185: Port allocation and wiring implemented correctly.The dynamic port allocation using
allocate_free_port()and explicit port passing to worker/frontend processes aligns with PR objectives. The use ofruntime_services_dynamic_portsfixture is correct for this test.tests/fault_tolerance/cancellation/test_sglang.py (2)
4-9: Clear timing documentation with GPU requirements noted.The timing annotations help understand resource requirements and expected execution time.
188-224: Test setup with dynamic ports looks correct.The port allocation and worker initialization pattern is clean and follows the established convention.
tests/fault_tolerance/migration/test_sglang.py (2)
4-11: Timing documentation includes skip status for unenabled tests.Clear documentation of which tests are skipped and total time for enabled tests.
142-203: Port allocation and test orchestration implemented correctly.Dynamic port allocation pattern is consistent with other tests. The use of both
runtime_services_dynamic_portsandset_ucx_tls_no_mmfixtures is appropriate for SGLang migration tests.tests/fault_tolerance/cancellation/test_trtllm.py (6)
4-11: Comprehensive timing documentation for multiple test scenarios.Good coverage of timing for aggregated, decode, prefill, and KV transfer cancellation tests.
42-131: Port wiring and health check logic implemented correctly.The
DynamoWorkerProcesscorrectly differentiates health checks based on mode:
prefill/decode: Uses system port health endpoint onlyprefill_and_decode: Uses frontend endpoints for models and healthThe port parameters are properly wired to environment variables and health check URLs.
152-240: Aggregated mode cancellation test properly allocates and uses dynamic ports.The test correctly allocates frontend and system ports, passes them to processes, and uses them for request dispatch.
243-335: Disaggregated decode cancellation test with proper port isolation.The test allocates separate ports for frontend, prefill worker, and decode worker, ensuring proper isolation. The worker startup order (frontend → prefill → decode) and health check wiring is correct.
338-438: Prefill cancellation test verifies decode worker never received the cancelled request.Good test design: validates that cancellation during prefill prevents the request from reaching the decode worker.
441-551: KV transfer cancellation test includes worker functionality verification after cancellation.Good practice to verify workers remain functional after a cancellation event (lines 538-548).
|
Updated:
|
Add port allocation utilities and update fault tolerance tests (vLLM, SGLang, TRT-LLM) to use dynamically allocated ports, enabling parallel test execution. - Add tests/utils/port_utils.py for dynamic port allocation - Update conftest.py: EtcdServer/NatsServer support dynamic ports - Move get_pid() to ManagedProcess base class - Update vLLM tests with dynamic port allocation and timing docs - Update SGLang tests with dynamic port allocation and timing docs - Update TRT-LLM tests with dynamic port allocation and timing docs Signed-off-by: Keiven Chang <[email protected]> Remove redundant comments and improve port allocation - Remove redundant 'Always call parent __exit__' comments - Fix return in finally blocks to avoid swallowing exceptions - Update port allocation to search entire valid range - Add validated timing benchmarks to port_utils.py Signed-off-by: Keiven Chang <[email protected]> Use runtime_services_dynamic_ports for vLLM tests Update vLLM cancellation and migration tests to use runtime_services_dynamic_ports fixture for fully dynamic NATS/Etcd ports, enabling true parallel test execution. - Restore max_tokens to 16384 in send_cancellable_request - Add TODO to etcd_ha/test_sglang.py for future port update Signed-off-by: Keiven Chang <[email protected]> Rename port functions and simplify cleanup logic Rename for clarity and consistency: - allocate_free_port -> allocate_port - allocate_free_ports -> allocate_ports - free_port -> deallocate_port - free_ports -> deallocate_ports Simplify __exit__ methods with comments explaining why checks are/aren't needed: - self.port always set in __init__ (no None check needed) - self.peer_port may be None (check required) - self.system_port required parameter (no None check needed) Signed-off-by: Keiven Chang <[email protected]> Move frontend port allocation into DynamoFrontendProcess Signed-off-by: Keiven Chang <[email protected]> Improve port allocation with random offset and caller tracking - Lower _PORT_MIN from 30000 to 1024 for wider port range - Add random offset (0-100) to start_port to reduce collision - Wrap around to _PORT_MIN instead of start_port - Reduce retries from full range scan to 100 attempts - Track caller info (file, function, line) in lock file for debugging - Handle both old and new registry formats for compatibility Signed-off-by: Keiven Chang <[email protected]>
Signed-off-by: Keiven Chang <[email protected]>
fc4528c to
338bbc1
Compare
Signed-off-by: Keiven Chang <[email protected]>
…i-dynamo#4835) Signed-off-by: Keiven Chang <[email protected]> Co-authored-by: Keiven Chang <[email protected]>
Overview:
Test isolation via dynamic port allocation. Tests now allocate unique ports at runtime instead of using hardcoded values, eliminating port conflicts to avoid port collisions (when previous tests don't terminate properly).
Details:
tests/utils/port_utils.pyfor thread-safe port allocation with file lockingconftest.py:EtcdServer/NatsServersupport dynamic ports (default 2379/4222, dynamic if port=None/0)Where should the reviewer start?
tests/utils/port_utils.py- Port allocation logic using File Locktests/conftest.py- Dynamic port support for EtcdServer/NatsServertests/fault_tolerance/cancellation/test_vllm.py- Example port allocation patternRelated Issues:
Relates to OPS-2436
/coderabbit profile chill
Summary by CodeRabbit
Tests
✏️ Tip: You can customize this high-level summary in your review settings.