Skip to content

[CI] Add ep tests to CI#1404

Merged
SumanthRH merged 5 commits intoNovaSky-AI:mainfrom
hao-aaron:ep-support
Mar 31, 2026
Merged

[CI] Add ep tests to CI#1404
SumanthRH merged 5 commits intoNovaSky-AI:mainfrom
hao-aaron:ep-support

Conversation

@hao-aaron
Copy link
Copy Markdown
Contributor

@hao-aaron hao-aaron commented Mar 28, 2026

modify ep tests to run with new inference, and add to skyrl train CI.

  • verify that ep tests pass with old and new inference on L4
  • run full Ci

Open with Devin

SumanthRH and others added 4 commits March 27, 2026 22:26
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
@hao-aaron hao-aaron marked this pull request as ready for review March 28, 2026 01:40
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables expert parallel (EP) inference support by adding a new CI test, updating the vLLM CLI argument builder to include the enable_expert_parallel flag, and refactoring EP inference tests to use the InferenceEngineState utility. The test suite now uses a smaller MoE model and ensures environment variables are correctly propagated during Ray initialization. A critical concern was raised regarding the commented-out connector cleanup logic in the remote inference client, which could lead to file descriptor leaks if the event loop changes.

Comment on lines +189 to +190
# TODO (aaron): fix cleanup
# _force_close_connector(self._session.connector)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Commenting out _force_close_connector could lead to file descriptor leaks if the event loop changes. The TODO indicates you're aware of this, but it's a potentially serious issue for a long-running service. Could you please address the cleanup now or explain why disabling this is safe? If it's causing issues in tests, perhaps there's a better way to handle session cleanup.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

@SumanthRH SumanthRH self-assigned this Mar 30, 2026
Comment on lines +189 to +190
# TODO (aaron): fix cleanup
# _force_close_connector(self._session.connector)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be reverted, I am addressing cleanup issues in #1387

_SKYRL_USE_NEW_INFERENCE=1 uv run --isolated --extra dev --extra fsdp pytest -s tests/backends/skyrl_train/gpu/gpu_ci/test_engine_generation.py
_SKYRL_USE_NEW_INFERENCE=1 uv run --isolated --extra dev --extra fsdp pytest -s tests/backends/skyrl_train/gpu/gpu_ci/test_skyrl_gym_generator.py
_SKYRL_USE_NEW_INFERENCE=1 uv run --isolated --extra dev --extra fsdp pytest -s tests/backends/skyrl_train/gpu/gpu_ci/test_lora.py
_SKYRL_USE_NEW_INFERENCE=1 uv run --isolated --extra dev --extra fsdp pytest -s tests/backends/skyrl_train/gpu/gpu_ci/test_expert_parallel_inference.py
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you timed how long this takes on L4s?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, but I don't remember it taking more than 10 mins. The model is pretty small

x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
@SumanthRH SumanthRH merged commit 2d9cb02 into NovaSky-AI:main Mar 31, 2026
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants