Skip to content

Conversation

@keivenchang
Copy link
Contributor

@keivenchang keivenchang commented Nov 5, 2025

Overview:

Enable parallel test execution in CI with pytest-xdist. Router and frontend tests now run with auto-detected workers, reducing execution time from 336s to 75s (4.5x speedup). Implements dynamic port allocation to prevent conflicts between parallel workers.

Details:

  • Fix hardcoded NATS ports in router test helpers (_test_router_indexers_sync, check_nats_consumers)
  • Refactor runtime_services fixture to conftest.py for reuse across test modules
  • Create tests/utils/port_utils.py for dynamic port allocation with file-based locking
  • Update container-validation-dynamo.yml to use -n auto instead of -n 4

Where should the reviewer start?

  • tests/utils/port_utils.py - dynamic port allocation mechanism
  • tests/conftest.py - shared fixtures for NATS/Etcd services
  • tests/router/common.py - port parameter updates in test helpers
  • .github/workflows/container-validation-dynamo.yml - CI configuration change

Related Issues:

DIS-1022

/coderabbit profile chill

@keivenchang keivenchang self-assigned this Nov 5, 2025
@keivenchang keivenchang requested review from a team as code owners November 5, 2025 01:16
@github-actions github-actions bot added the feat label Nov 5, 2025
@keivenchang keivenchang requested a review from kthui November 5, 2025 01:16
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 5, 2025

Walkthrough

This change introduces dynamic port configuration across launch scripts and test infrastructure. Shell scripts for sglang, trtllm, and vllm backends now parameterize frontend and worker ports using environment variables (DYN_FRONTEND_PORT, DYN_SYSTEM_PORT) with defaults. Test utilities and a new port allocation module enable explicit port passing for test execution.

Changes

Cohort / File(s) Summary
sglang Launch Scripts
examples/backends/sglang/launch/agg.sh, agg_embed.sh, agg_router.sh, disagg.sh, disagg_dp_attn.sh, disagg_router.sh, disagg_same_gpu.sh, multimodal_agg.sh, multimodal_disagg.sh
Replace fixed frontend port 8000 with ${DYN_FRONTEND_PORT:-8000} and fixed worker ports with ${DYN_SYSTEM_PORT*:-port} environment variables
trtllm Launch Scripts
examples/backends/trtllm/launch/agg.sh, agg_metrics.sh, agg_router.sh, disagg.sh, disagg_router.sh, disagg_same_gpu.sh, epd_disagg.sh, gpt_oss_disagg.sh, performance_sweeps/scripts/start_frontend.sh
Parameterize frontend and worker ports using ${DYN_FRONTEND_PORT:-8000} and ${DYN_SYSTEM_PORT*:-port} environment variables
vllm Launch Scripts
examples/backends/vllm/launch/agg.sh, agg_kvbm.sh, agg_kvbm_router.sh, agg_lmcache.sh, agg_multimodal.sh, agg_multimodal_llama.sh, agg_router.sh, dep.sh, disagg.sh, disagg_kvbm.sh, disagg_kvbm_2p2d.sh, disagg_kvbm_router.sh, disagg_lmcache.sh, disagg_router.sh, dsr1_dep.sh
Replace hardcoded frontend port 8000 with ${DYN_FRONTEND_PORT:-8000} parameter across all launch configurations
Test Cancellation Utilities
tests/fault_tolerance/cancellation/utils.py
Expand DynamoFrontendProcess, send_completion_request, send_chat_completion_request, and send_cancellable_request signatures to accept explicit frontend_port parameter; update all URL construction to use provided port instead of hardcoded constant
Test Cancellation - sglang
tests/fault_tolerance/cancellation/test_sglang.py
Refactor DynamoWorkerProcess constructor to require system_port and frontend_port; update health checks and environment setup to use dynamic ports; adjust all test calls to pass allocated ports
Test Cancellation - trtllm
tests/fault_tolerance/cancellation/test_trtllm.py
Expand DynamoWorkerProcess to accept system_port and frontend_port; add get_pid() and is_ready() public methods; update health URL construction and test orchestration to use dynamic ports
Test Cancellation - vllm
tests/fault_tolerance/cancellation/test_vllm.py
Refactor DynamoWorkerProcess and DynamoFrontendProcess to require explicit ports; remove hardcoded port logic; update all test setups to allocate and pass frontend_port and system_port
Test Launch Verification
tests/serve/launch/template_verifier.sh
Replace hardcoded frontend port 8000 with ${DYN_FRONTEND_PORT:-8000} environment variable
Port Allocation Utility
tests/utils/port_utils.py
New module providing get_free_port(), get_free_ports(count), set_frontend_port_env(), set_worker_ports_env(), and set_dynamic_ports_env() functions for test port management

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

  • Shell scripts (38+ files): Highly repetitive parameterization pattern applied uniformly across backends; low individual complexity but large aggregate scope
  • Test utilities refactoring: More substantive changes with signature modifications to core test helper functions; requires verification of port passing consistency across multiple test scenarios
  • New utility module: Straightforward socket-based port allocation with sensible defaults; moderate complexity in concurrent port handling

Areas requiring extra attention:

  • tests/fault_tolerance/cancellation/utils.py — Verify all function signatures have been consistently updated and all internal URL constructions reference the new frontend_port parameter
  • tests/fault_tolerance/cancellation/test_sglang.py, test_trtllm.py, test_vllm.py — Confirm all test cases correctly pass allocated ports to worker and frontend process constructors, and that health check endpoints reference the correct dynamic ports
  • tests/utils/port_utils.py — Validate port allocation logic handles race conditions between socket closure and port binding in test execution

Poem

🐰 Ports once fixed, now dance and flow,
Environment whispers where they'll go,
Dynamic config, flexible and bright,
Tests allocate freely through the night,
Launch scripts hop with defaults true,
Eight thousand, eighty-one, eighty-two— 🏃‍♂️✨

Pre-merge checks

✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: introducing dynamic port allocation for tests to prevent conflicts.
Description check ✅ Passed PR description provides detailed overview, specific changes, and clear reviewer entry points, matching the required template structure well.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5528f3b and de716ff.

📒 Files selected for processing (39)
  • examples/backends/sglang/launch/agg.sh (1 hunks)
  • examples/backends/sglang/launch/agg_embed.sh (1 hunks)
  • examples/backends/sglang/launch/agg_router.sh (1 hunks)
  • examples/backends/sglang/launch/disagg.sh (1 hunks)
  • examples/backends/sglang/launch/disagg_dp_attn.sh (1 hunks)
  • examples/backends/sglang/launch/disagg_router.sh (1 hunks)
  • examples/backends/sglang/launch/disagg_same_gpu.sh (2 hunks)
  • examples/backends/sglang/launch/multimodal_agg.sh (1 hunks)
  • examples/backends/sglang/launch/multimodal_disagg.sh (1 hunks)
  • examples/backends/trtllm/launch/agg.sh (1 hunks)
  • examples/backends/trtllm/launch/agg_metrics.sh (1 hunks)
  • examples/backends/trtllm/launch/agg_router.sh (1 hunks)
  • examples/backends/trtllm/launch/disagg.sh (1 hunks)
  • examples/backends/trtllm/launch/disagg_router.sh (1 hunks)
  • examples/backends/trtllm/launch/disagg_same_gpu.sh (2 hunks)
  • examples/backends/trtllm/launch/epd_disagg.sh (1 hunks)
  • examples/backends/trtllm/launch/gpt_oss_disagg.sh (1 hunks)
  • examples/backends/trtllm/performance_sweeps/scripts/start_frontend.sh (1 hunks)
  • examples/backends/vllm/launch/agg.sh (1 hunks)
  • examples/backends/vllm/launch/agg_kvbm.sh (1 hunks)
  • examples/backends/vllm/launch/agg_kvbm_router.sh (1 hunks)
  • examples/backends/vllm/launch/agg_lmcache.sh (1 hunks)
  • examples/backends/vllm/launch/agg_multimodal.sh (1 hunks)
  • examples/backends/vllm/launch/agg_multimodal_llama.sh (1 hunks)
  • examples/backends/vllm/launch/agg_router.sh (1 hunks)
  • examples/backends/vllm/launch/dep.sh (1 hunks)
  • examples/backends/vllm/launch/disagg.sh (1 hunks)
  • examples/backends/vllm/launch/disagg_kvbm.sh (1 hunks)
  • examples/backends/vllm/launch/disagg_kvbm_2p2d.sh (1 hunks)
  • examples/backends/vllm/launch/disagg_kvbm_router.sh (1 hunks)
  • examples/backends/vllm/launch/disagg_lmcache.sh (1 hunks)
  • examples/backends/vllm/launch/disagg_router.sh (1 hunks)
  • examples/backends/vllm/launch/dsr1_dep.sh (1 hunks)
  • tests/fault_tolerance/cancellation/test_sglang.py (7 hunks)
  • tests/fault_tolerance/cancellation/test_trtllm.py (12 hunks)
  • tests/fault_tolerance/cancellation/test_vllm.py (8 hunks)
  • tests/fault_tolerance/cancellation/utils.py (9 hunks)
  • tests/serve/launch/template_verifier.sh (1 hunks)
  • tests/utils/port_utils.py (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: tzulingk
Repo: ai-dynamo/dynamo PR: 3194
File: tests/utils/managed_deployment.py:620-656
Timestamp: 2025-09-24T17:45:54.735Z
Learning: In the fault tolerance test infrastructure for Dynamo, port-forwards created by the `port_forward` method in `ManagedDeployment` are intended to run in the background and should not be stopped immediately after use. They are designed for persistent connections that can be reused across multiple metrics collections during test execution.
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3051
File: container/templates/Dockerfile.trtllm.j2:424-437
Timestamp: 2025-09-16T17:16:03.785Z
Learning: keivenchang prioritizes maintaining exact backward compatibility during migration/refactoring PRs, even when bugs are identified in the original code. Fixes should be deferred to separate PRs after the migration is complete.
📚 Learning: 2025-09-24T17:45:54.735Z
Learnt from: tzulingk
Repo: ai-dynamo/dynamo PR: 3194
File: tests/utils/managed_deployment.py:620-656
Timestamp: 2025-09-24T17:45:54.735Z
Learning: In the fault tolerance test infrastructure for Dynamo, port-forwards created by the `port_forward` method in `ManagedDeployment` are intended to run in the background and should not be stopped immediately after use. They are designed for persistent connections that can be reused across multiple metrics collections during test execution.

Applied to files:

  • tests/fault_tolerance/cancellation/test_vllm.py
  • tests/fault_tolerance/cancellation/test_sglang.py
  • tests/fault_tolerance/cancellation/utils.py
  • tests/fault_tolerance/cancellation/test_trtllm.py
🧬 Code graph analysis (3)
tests/fault_tolerance/cancellation/test_vllm.py (2)
tests/utils/port_utils.py (1)
  • get_free_port (32-46)
tests/fault_tolerance/cancellation/utils.py (2)
  • DynamoFrontendProcess (22-50)
  • send_cancellable_request (244-270)
tests/fault_tolerance/cancellation/test_sglang.py (3)
tests/utils/port_utils.py (1)
  • get_free_port (32-46)
tests/utils/managed_process.py (1)
  • ManagedProcess (71-568)
tests/fault_tolerance/cancellation/utils.py (2)
  • DynamoFrontendProcess (22-50)
  • send_cancellable_request (244-270)
tests/fault_tolerance/cancellation/test_trtllm.py (2)
tests/utils/port_utils.py (1)
  • get_free_port (32-46)
tests/fault_tolerance/cancellation/utils.py (2)
  • DynamoFrontendProcess (22-50)
  • send_cancellable_request (244-270)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: operator (amd64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (35)
examples/backends/vllm/launch/disagg_router.sh (1)

18-18: ✓ Correct parameterization of frontend port.

The change from hard-coded 8000 to ${DYN_FRONTEND_PORT:-8000} is correct and properly uses bash parameter expansion with a sensible default. This maintains backward compatibility while enabling dynamic port allocation for tests.

examples/backends/vllm/launch/agg_multimodal_llama.sh (1)

11-11: ✓ Dynamic port allocation correctly implemented.

The parameter expansion syntax is correct and maintains backward compatibility by defaulting to port 8000 if DYN_FRONTEND_PORT is not set. This aligns with the PR's pattern for enabling dynamic port allocation across launch scripts.

examples/backends/vllm/launch/disagg_kvbm_router.sh (1)

16-16: Dynamic port allocation properly implemented.

The change correctly parameterizes the frontend HTTP port using the DYN_FRONTEND_PORT environment variable with a sensible default fallback to 8000. This enables the script to work in both isolated and dynamic port scenarios while maintaining backward compatibility.

examples/backends/vllm/launch/dsr1_dep.sh (1)

86-86: Dynamic port allocation correctly implemented with proper fallback.

The use of ${DYN_FRONTEND_PORT:-8000} properly enables dynamic port allocation for tests while maintaining backward compatibility. Bash variable expansion syntax is correct, and the fallback to the original hard-coded port (8000) ensures existing scripts continue to work unchanged.

examples/backends/vllm/launch/disagg.sh (1)

8-8: Correct bash parameter expansion for dynamic port allocation.

The change properly uses bash parameter expansion (${DYN_FRONTEND_PORT:-8000}) to support dynamic frontend port allocation while maintaining backward compatibility with the hardcoded default of 8000. This aligns with the PR objectives.

Verify that the vLLM worker processes on lines 11 and 13–16 can properly discover or connect to the frontend using the dynamically allocated port. If workers don't auto-discover the frontend port, ensure the DYN_FRONTEND_PORT environment variable is propagated to them or passed explicitly to their invocations.

examples/backends/vllm/launch/disagg_lmcache.sh (1)

8-8: Verify consistency of port configuration across backend launch scripts.

The change correctly parameterizes the frontend port using ${DYN_FRONTEND_PORT:-8000} with proper backward compatibility. However, the PR objectives mention updating both frontend (DYN_FRONTEND_PORT) and system/worker ports (DYN_SYSTEM_PORT) across 33 backend launch scripts. This file updates the frontend port but does not explicitly configure worker ports (lines 11, 22).

Please verify:

  1. Whether worker processes in this script (decode and prefill workers on lines 11 and 22) require explicit port configuration via environment variables (e.g., DYN_SYSTEM_PORT or similar).
  2. Whether this approach is consistent with how other backend launch scripts (sglang, trtllm) in this PR handle worker port parameterization.

If workers do require dynamic port configuration, apply the same parameterization pattern to worker launch commands.

examples/backends/trtllm/launch/disagg_same_gpu.sh (1)

52-52: Frontend port parameterization looks correct.

The dynamic frontend port allocation via ${DYN_FRONTEND_PORT:-8000} is properly configured and maintains backward compatibility.

examples/backends/vllm/launch/disagg_kvbm.sh (2)

8-8: Correct implementation of dynamic frontend port.

The change correctly uses bash parameter expansion to read DYN_FRONTEND_PORT with a fallback to 8000, maintaining backward compatibility while enabling dynamic port allocation for test scenarios.


12-12: Verify scope: are worker processes expected to use dynamic ports?

The worker processes on lines 12 and 18-22 don't have explicit port configuration. Per the PR objectives, both DYN_FRONTEND_PORT and DYN_SYSTEM_PORT should be available across backend scripts. If these workers expose HTTP ports or are health-checked dynamically by tests, they may also need DYN_SYSTEM_PORT handling.

Can you confirm whether the worker processes in vLLM backends require dynamic port configuration, or if they use port configuration through alternative mechanisms?

Also applies to: 18-22

examples/backends/sglang/launch/disagg.sh (1)

15-17: Frontend port parameterization looks good.

Using ${DYN_FRONTEND_PORT:-8000} with a sensible default maintains backward compatibility while enabling dynamic port configuration. The change aligns with the PR objective and introduces no breaking changes.

examples/backends/vllm/launch/agg_kvbm_router.sh (1)

14-17: Frontend port parameterization correct.

The multi-line formatting with space-separated --http-port maintains the existing style while introducing environment variable control. Default of 8000 preserves prior behavior.

examples/backends/vllm/launch/agg_multimodal.sh (1)

55-57: Frontend port parameterization correct.

Environment variable with fallback to 8000 is properly applied. No impact on multimodal processor or worker configurations.

examples/backends/sglang/launch/agg_embed.sh (1)

15-17: Frontend port parameterization correct.

Consistent parameterization with other backend scripts. Default of 8000 maintains prior behavior while enabling dynamic allocation.

examples/backends/vllm/launch/agg_kvbm.sh (1)

7-9: Frontend port parameterization correct.

Environment variable substitution with default maintains backward compatibility. KVBM worker configuration unaffected.

examples/backends/vllm/launch/agg_lmcache.sh (1)

7-9: Frontend port parameterization correct.

Standard environment variable pattern with fallback. LMCache-specific configuration remains unchanged.

examples/backends/vllm/launch/disagg_kvbm_2p2d.sh (1)

7-9: Frontend port parameterization correct.

Environment variable substitution properly integrated with router mode flag. Worker KVBM and GPU configurations unaffected.

examples/backends/sglang/launch/multimodal_agg.sh (1)

62-64: Frontend port parameterization correct.

Standard environment variable pattern applied consistently. Multimodal processor and worker configurations remain unaffected.

examples/backends/sglang/launch/disagg_dp_attn.sh (1)

16-16: Dynamic port configuration aligns with PR objectives.

The change correctly parametrizes the frontend port using idiomatic bash parameter expansion with sensible defaults. This enables tests to allocate non-conflicting ports dynamically while maintaining backward compatibility.

examples/backends/sglang/launch/disagg_router.sh (1)

15-19: Port parametrization preserves multi-line command structure.

The frontend port is correctly extracted from the environment with appropriate fallback. The line continuation and formatting are maintained properly.

examples/backends/sglang/launch/agg_router.sh (1)

15-19: LGTM.

Port configuration correctly uses environment variable with fallback. Command structure is preserved.

examples/backends/trtllm/performance_sweeps/scripts/start_frontend.sh (1)

22-24: Dynamic port configuration works correctly.

The environment variable fallback is properly applied. The space before the parameter expansion (e.g., --http-port ${VAR}) is valid bash, though note that other launch scripts use --http-port=${VAR} without spacing for consistency.

examples/backends/trtllm/launch/agg_metrics.sh (1)

22-26: Both frontend and system ports correctly parametrized.

Environment variables with sensible defaults (8000, 8081) enable dynamic port allocation. The inline environment variable assignment for DYN_SYSTEM_PORT is idiomatic and correctly structured for the subsequent python command.

examples/backends/vllm/launch/agg_router.sh (1)

15-19: LGTM.

Port is correctly extracted from environment variable with proper fallback. Line continuation is properly formatted.

examples/backends/vllm/launch/agg.sh (1)

8-12: Dynamic port allocation correctly configured for both frontend and system.

Both ports are properly parametrized with environment variables and sensible defaults. Inline environment variable assignment for DYN_SYSTEM_PORT is correctly positioned before the python command.

examples/backends/sglang/launch/agg.sh (1)

16-20: Dynamic port allocation correctly implemented for both frontend and system.

Environment variables with defaults (8000, 8081) are properly configured. Inline environment variable assignment for DYN_SYSTEM_PORT maintains command structure.

examples/backends/vllm/launch/dep.sh (1)

8-8: LGTM!

The frontend port parameterization is correct and maintains backward compatibility.

examples/backends/sglang/launch/disagg_same_gpu.sh (2)

40-40: LGTM!

Frontend port parameterization is correct with sensible default.


44-44: LGTM!

Worker port parameterization using numbered environment variables (DYN_SYSTEM_PORT1, DYN_SYSTEM_PORT2) is appropriate for disaggregated multi-worker setup. Syntax is correct.

Also applies to: 74-74

examples/backends/sglang/launch/multimodal_disagg.sh (1)

63-63: LGTM!

Frontend port parameterization correctly applied with sensible default.

examples/backends/trtllm/launch/disagg_router.sh (1)

26-26: LGTM!

Frontend port parameterization is correct. Note: space syntax (--http-port ${...}) is functionally equivalent to equals syntax used in other backends.

examples/backends/trtllm/launch/disagg.sh (1)

29-29: LGTM!

Frontend port parameterization correctly applied with default 8000.

examples/backends/trtllm/launch/agg_router.sh (1)

22-22: LGTM!

Frontend port parameterization correctly applied for aggregated mode.

examples/backends/trtllm/launch/gpt_oss_disagg.sh (1)

18-18: LGTM!

Frontend port parameterization correctly applied with sensible default.

examples/backends/trtllm/launch/agg.sh (1)

25-25: LGTM!

Frontend port parameterization correctly applied for aggregated mode with sensible default.

tests/fault_tolerance/cancellation/utils.py (1)

286-295: Nice guardrail on streaming response handling

Appreciate the explicit None/status checks before iterating; this should surface transient startup issues with a clear pytest failure instead of a cryptic attribute error.

@keivenchang keivenchang force-pushed the keivenchang/python-tests-to-use-available-port branch from f15a967 to 43a2925 Compare December 4, 2025 23:47
@keivenchang keivenchang force-pushed the keivenchang/python-tests-to-use-available-port branch from 1d96991 to 4b41801 Compare December 5, 2025 22:27
@keivenchang keivenchang force-pushed the keivenchang/python-tests-to-use-available-port branch from 4b41801 to 2d58074 Compare December 8, 2025 20:32
@keivenchang keivenchang force-pushed the keivenchang/python-tests-to-use-available-port branch from 2d58074 to 24bf929 Compare December 9, 2025 01:57
Move get_pid() to ManagedProcess base class

Signed-off-by: Keiven Chang <[email protected]>
@keivenchang keivenchang force-pushed the keivenchang/python-tests-to-use-available-port branch from 24bf929 to a01696b Compare December 9, 2025 07:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants