-
Notifications
You must be signed in to change notification settings - Fork 728
feat: add dynamic port allocation to prevent test conflicts (part 2) #4119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThis change introduces dynamic port configuration across launch scripts and test infrastructure. Shell scripts for sglang, trtllm, and vllm backends now parameterize frontend and worker ports using environment variables (DYN_FRONTEND_PORT, DYN_SYSTEM_PORT) with defaults. Test utilities and a new port allocation module enable explicit port passing for test execution. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes
Areas requiring extra attention:
Poem
Pre-merge checks✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (39)
examples/backends/sglang/launch/agg.sh(1 hunks)examples/backends/sglang/launch/agg_embed.sh(1 hunks)examples/backends/sglang/launch/agg_router.sh(1 hunks)examples/backends/sglang/launch/disagg.sh(1 hunks)examples/backends/sglang/launch/disagg_dp_attn.sh(1 hunks)examples/backends/sglang/launch/disagg_router.sh(1 hunks)examples/backends/sglang/launch/disagg_same_gpu.sh(2 hunks)examples/backends/sglang/launch/multimodal_agg.sh(1 hunks)examples/backends/sglang/launch/multimodal_disagg.sh(1 hunks)examples/backends/trtllm/launch/agg.sh(1 hunks)examples/backends/trtllm/launch/agg_metrics.sh(1 hunks)examples/backends/trtllm/launch/agg_router.sh(1 hunks)examples/backends/trtllm/launch/disagg.sh(1 hunks)examples/backends/trtllm/launch/disagg_router.sh(1 hunks)examples/backends/trtllm/launch/disagg_same_gpu.sh(2 hunks)examples/backends/trtllm/launch/epd_disagg.sh(1 hunks)examples/backends/trtllm/launch/gpt_oss_disagg.sh(1 hunks)examples/backends/trtllm/performance_sweeps/scripts/start_frontend.sh(1 hunks)examples/backends/vllm/launch/agg.sh(1 hunks)examples/backends/vllm/launch/agg_kvbm.sh(1 hunks)examples/backends/vllm/launch/agg_kvbm_router.sh(1 hunks)examples/backends/vllm/launch/agg_lmcache.sh(1 hunks)examples/backends/vllm/launch/agg_multimodal.sh(1 hunks)examples/backends/vllm/launch/agg_multimodal_llama.sh(1 hunks)examples/backends/vllm/launch/agg_router.sh(1 hunks)examples/backends/vllm/launch/dep.sh(1 hunks)examples/backends/vllm/launch/disagg.sh(1 hunks)examples/backends/vllm/launch/disagg_kvbm.sh(1 hunks)examples/backends/vllm/launch/disagg_kvbm_2p2d.sh(1 hunks)examples/backends/vllm/launch/disagg_kvbm_router.sh(1 hunks)examples/backends/vllm/launch/disagg_lmcache.sh(1 hunks)examples/backends/vllm/launch/disagg_router.sh(1 hunks)examples/backends/vllm/launch/dsr1_dep.sh(1 hunks)tests/fault_tolerance/cancellation/test_sglang.py(7 hunks)tests/fault_tolerance/cancellation/test_trtllm.py(12 hunks)tests/fault_tolerance/cancellation/test_vllm.py(8 hunks)tests/fault_tolerance/cancellation/utils.py(9 hunks)tests/serve/launch/template_verifier.sh(1 hunks)tests/utils/port_utils.py(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: tzulingk
Repo: ai-dynamo/dynamo PR: 3194
File: tests/utils/managed_deployment.py:620-656
Timestamp: 2025-09-24T17:45:54.735Z
Learning: In the fault tolerance test infrastructure for Dynamo, port-forwards created by the `port_forward` method in `ManagedDeployment` are intended to run in the background and should not be stopped immediately after use. They are designed for persistent connections that can be reused across multiple metrics collections during test execution.
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3051
File: container/templates/Dockerfile.trtllm.j2:424-437
Timestamp: 2025-09-16T17:16:03.785Z
Learning: keivenchang prioritizes maintaining exact backward compatibility during migration/refactoring PRs, even when bugs are identified in the original code. Fixes should be deferred to separate PRs after the migration is complete.
📚 Learning: 2025-09-24T17:45:54.735Z
Learnt from: tzulingk
Repo: ai-dynamo/dynamo PR: 3194
File: tests/utils/managed_deployment.py:620-656
Timestamp: 2025-09-24T17:45:54.735Z
Learning: In the fault tolerance test infrastructure for Dynamo, port-forwards created by the `port_forward` method in `ManagedDeployment` are intended to run in the background and should not be stopped immediately after use. They are designed for persistent connections that can be reused across multiple metrics collections during test execution.
Applied to files:
tests/fault_tolerance/cancellation/test_vllm.pytests/fault_tolerance/cancellation/test_sglang.pytests/fault_tolerance/cancellation/utils.pytests/fault_tolerance/cancellation/test_trtllm.py
🧬 Code graph analysis (3)
tests/fault_tolerance/cancellation/test_vllm.py (2)
tests/utils/port_utils.py (1)
get_free_port(32-46)tests/fault_tolerance/cancellation/utils.py (2)
DynamoFrontendProcess(22-50)send_cancellable_request(244-270)
tests/fault_tolerance/cancellation/test_sglang.py (3)
tests/utils/port_utils.py (1)
get_free_port(32-46)tests/utils/managed_process.py (1)
ManagedProcess(71-568)tests/fault_tolerance/cancellation/utils.py (2)
DynamoFrontendProcess(22-50)send_cancellable_request(244-270)
tests/fault_tolerance/cancellation/test_trtllm.py (2)
tests/utils/port_utils.py (1)
get_free_port(32-46)tests/fault_tolerance/cancellation/utils.py (2)
DynamoFrontendProcess(22-50)send_cancellable_request(244-270)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: operator (amd64)
- GitHub Check: trtllm (amd64)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (35)
examples/backends/vllm/launch/disagg_router.sh (1)
18-18: ✓ Correct parameterization of frontend port.The change from hard-coded
8000to${DYN_FRONTEND_PORT:-8000}is correct and properly uses bash parameter expansion with a sensible default. This maintains backward compatibility while enabling dynamic port allocation for tests.examples/backends/vllm/launch/agg_multimodal_llama.sh (1)
11-11: ✓ Dynamic port allocation correctly implemented.The parameter expansion syntax is correct and maintains backward compatibility by defaulting to port 8000 if
DYN_FRONTEND_PORTis not set. This aligns with the PR's pattern for enabling dynamic port allocation across launch scripts.examples/backends/vllm/launch/disagg_kvbm_router.sh (1)
16-16: Dynamic port allocation properly implemented.The change correctly parameterizes the frontend HTTP port using the
DYN_FRONTEND_PORTenvironment variable with a sensible default fallback to8000. This enables the script to work in both isolated and dynamic port scenarios while maintaining backward compatibility.examples/backends/vllm/launch/dsr1_dep.sh (1)
86-86: Dynamic port allocation correctly implemented with proper fallback.The use of
${DYN_FRONTEND_PORT:-8000}properly enables dynamic port allocation for tests while maintaining backward compatibility. Bash variable expansion syntax is correct, and the fallback to the original hard-coded port (8000) ensures existing scripts continue to work unchanged.examples/backends/vllm/launch/disagg.sh (1)
8-8: Correct bash parameter expansion for dynamic port allocation.The change properly uses bash parameter expansion (
${DYN_FRONTEND_PORT:-8000}) to support dynamic frontend port allocation while maintaining backward compatibility with the hardcoded default of 8000. This aligns with the PR objectives.Verify that the vLLM worker processes on lines 11 and 13–16 can properly discover or connect to the frontend using the dynamically allocated port. If workers don't auto-discover the frontend port, ensure the
DYN_FRONTEND_PORTenvironment variable is propagated to them or passed explicitly to their invocations.examples/backends/vllm/launch/disagg_lmcache.sh (1)
8-8: Verify consistency of port configuration across backend launch scripts.The change correctly parameterizes the frontend port using
${DYN_FRONTEND_PORT:-8000}with proper backward compatibility. However, the PR objectives mention updating both frontend (DYN_FRONTEND_PORT) and system/worker ports (DYN_SYSTEM_PORT) across 33 backend launch scripts. This file updates the frontend port but does not explicitly configure worker ports (lines 11, 22).Please verify:
- Whether worker processes in this script (decode and prefill workers on lines 11 and 22) require explicit port configuration via environment variables (e.g., DYN_SYSTEM_PORT or similar).
- Whether this approach is consistent with how other backend launch scripts (sglang, trtllm) in this PR handle worker port parameterization.
If workers do require dynamic port configuration, apply the same parameterization pattern to worker launch commands.
examples/backends/trtllm/launch/disagg_same_gpu.sh (1)
52-52: Frontend port parameterization looks correct.The dynamic frontend port allocation via
${DYN_FRONTEND_PORT:-8000}is properly configured and maintains backward compatibility.examples/backends/vllm/launch/disagg_kvbm.sh (2)
8-8: Correct implementation of dynamic frontend port.The change correctly uses bash parameter expansion to read
DYN_FRONTEND_PORTwith a fallback to8000, maintaining backward compatibility while enabling dynamic port allocation for test scenarios.
12-12: Verify scope: are worker processes expected to use dynamic ports?The worker processes on lines 12 and 18-22 don't have explicit port configuration. Per the PR objectives, both
DYN_FRONTEND_PORTandDYN_SYSTEM_PORTshould be available across backend scripts. If these workers expose HTTP ports or are health-checked dynamically by tests, they may also needDYN_SYSTEM_PORThandling.Can you confirm whether the worker processes in vLLM backends require dynamic port configuration, or if they use port configuration through alternative mechanisms?
Also applies to: 18-22
examples/backends/sglang/launch/disagg.sh (1)
15-17: Frontend port parameterization looks good.Using
${DYN_FRONTEND_PORT:-8000}with a sensible default maintains backward compatibility while enabling dynamic port configuration. The change aligns with the PR objective and introduces no breaking changes.examples/backends/vllm/launch/agg_kvbm_router.sh (1)
14-17: Frontend port parameterization correct.The multi-line formatting with space-separated
--http-portmaintains the existing style while introducing environment variable control. Default of 8000 preserves prior behavior.examples/backends/vllm/launch/agg_multimodal.sh (1)
55-57: Frontend port parameterization correct.Environment variable with fallback to 8000 is properly applied. No impact on multimodal processor or worker configurations.
examples/backends/sglang/launch/agg_embed.sh (1)
15-17: Frontend port parameterization correct.Consistent parameterization with other backend scripts. Default of 8000 maintains prior behavior while enabling dynamic allocation.
examples/backends/vllm/launch/agg_kvbm.sh (1)
7-9: Frontend port parameterization correct.Environment variable substitution with default maintains backward compatibility. KVBM worker configuration unaffected.
examples/backends/vllm/launch/agg_lmcache.sh (1)
7-9: Frontend port parameterization correct.Standard environment variable pattern with fallback. LMCache-specific configuration remains unchanged.
examples/backends/vllm/launch/disagg_kvbm_2p2d.sh (1)
7-9: Frontend port parameterization correct.Environment variable substitution properly integrated with router mode flag. Worker KVBM and GPU configurations unaffected.
examples/backends/sglang/launch/multimodal_agg.sh (1)
62-64: Frontend port parameterization correct.Standard environment variable pattern applied consistently. Multimodal processor and worker configurations remain unaffected.
examples/backends/sglang/launch/disagg_dp_attn.sh (1)
16-16: Dynamic port configuration aligns with PR objectives.The change correctly parametrizes the frontend port using idiomatic bash parameter expansion with sensible defaults. This enables tests to allocate non-conflicting ports dynamically while maintaining backward compatibility.
examples/backends/sglang/launch/disagg_router.sh (1)
15-19: Port parametrization preserves multi-line command structure.The frontend port is correctly extracted from the environment with appropriate fallback. The line continuation and formatting are maintained properly.
examples/backends/sglang/launch/agg_router.sh (1)
15-19: LGTM.Port configuration correctly uses environment variable with fallback. Command structure is preserved.
examples/backends/trtllm/performance_sweeps/scripts/start_frontend.sh (1)
22-24: Dynamic port configuration works correctly.The environment variable fallback is properly applied. The space before the parameter expansion (e.g.,
--http-port ${VAR}) is valid bash, though note that other launch scripts use--http-port=${VAR}without spacing for consistency.examples/backends/trtllm/launch/agg_metrics.sh (1)
22-26: Both frontend and system ports correctly parametrized.Environment variables with sensible defaults (8000, 8081) enable dynamic port allocation. The inline environment variable assignment for DYN_SYSTEM_PORT is idiomatic and correctly structured for the subsequent python command.
examples/backends/vllm/launch/agg_router.sh (1)
15-19: LGTM.Port is correctly extracted from environment variable with proper fallback. Line continuation is properly formatted.
examples/backends/vllm/launch/agg.sh (1)
8-12: Dynamic port allocation correctly configured for both frontend and system.Both ports are properly parametrized with environment variables and sensible defaults. Inline environment variable assignment for DYN_SYSTEM_PORT is correctly positioned before the python command.
examples/backends/sglang/launch/agg.sh (1)
16-20: Dynamic port allocation correctly implemented for both frontend and system.Environment variables with defaults (8000, 8081) are properly configured. Inline environment variable assignment for DYN_SYSTEM_PORT maintains command structure.
examples/backends/vllm/launch/dep.sh (1)
8-8: LGTM!The frontend port parameterization is correct and maintains backward compatibility.
examples/backends/sglang/launch/disagg_same_gpu.sh (2)
40-40: LGTM!Frontend port parameterization is correct with sensible default.
44-44: LGTM!Worker port parameterization using numbered environment variables (DYN_SYSTEM_PORT1, DYN_SYSTEM_PORT2) is appropriate for disaggregated multi-worker setup. Syntax is correct.
Also applies to: 74-74
examples/backends/sglang/launch/multimodal_disagg.sh (1)
63-63: LGTM!Frontend port parameterization correctly applied with sensible default.
examples/backends/trtllm/launch/disagg_router.sh (1)
26-26: LGTM!Frontend port parameterization is correct. Note: space syntax (
--http-port ${...}) is functionally equivalent to equals syntax used in other backends.examples/backends/trtllm/launch/disagg.sh (1)
29-29: LGTM!Frontend port parameterization correctly applied with default 8000.
examples/backends/trtllm/launch/agg_router.sh (1)
22-22: LGTM!Frontend port parameterization correctly applied for aggregated mode.
examples/backends/trtllm/launch/gpt_oss_disagg.sh (1)
18-18: LGTM!Frontend port parameterization correctly applied with sensible default.
examples/backends/trtllm/launch/agg.sh (1)
25-25: LGTM!Frontend port parameterization correctly applied for aggregated mode with sensible default.
tests/fault_tolerance/cancellation/utils.py (1)
286-295: Nice guardrail on streaming response handlingAppreciate the explicit None/status checks before iterating; this should surface transient startup issues with a clear pytest failure instead of a cryptic attribute error.
b0a5e58 to
a4ce182
Compare
8f7a4b2 to
dc410c8
Compare
f15a967 to
43a2925
Compare
1d96991 to
4b41801
Compare
4b41801 to
2d58074
Compare
2d58074 to
24bf929
Compare
Move get_pid() to ManagedProcess base class Signed-off-by: Keiven Chang <[email protected]>
24bf929 to
a01696b
Compare
Overview:
Enable parallel test execution in CI with pytest-xdist. Router and frontend tests now run with auto-detected workers, reducing execution time from 336s to 75s (4.5x speedup). Implements dynamic port allocation to prevent conflicts between parallel workers.
Details:
_test_router_indexers_sync,check_nats_consumers)runtime_servicesfixture toconftest.pyfor reuse across test modulestests/utils/port_utils.pyfor dynamic port allocation with file-based lockingcontainer-validation-dynamo.ymlto use-n autoinstead of-n 4Where should the reviewer start?
tests/utils/port_utils.py- dynamic port allocation mechanismtests/conftest.py- shared fixtures for NATS/Etcd servicestests/router/common.py- port parameter updates in test helpers.github/workflows/container-validation-dynamo.yml- CI configuration changeRelated Issues:
DIS-1022
/coderabbit profile chill