Skip to content

Add Megatron-FSDP E2E integration test to TE CI/CD (L1).#2845

Open
cspades wants to merge 7 commits intoNVIDIA:mainfrom
cspades:cye/mfsdp-te-e2e-test
Open

Add Megatron-FSDP E2E integration test to TE CI/CD (L1).#2845
cspades wants to merge 7 commits intoNVIDIA:mainfrom
cspades:cye/mfsdp-te-e2e-test

Conversation

@cspades
Copy link
Copy Markdown
Member

@cspades cspades commented Apr 7, 2026

Description

  • Adds Megatron-FSDP E2E integration tests to the TransformerEngine CI/CD.

Details

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Cory Ye <cye@nvidia.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 7, 2026

Greptile Summary

This PR adds a new L1 CI test (qa/L1_pytorch_mcore_fsdp_integration/) that runs a short Megatron-FSDP GPT pre-training job to catch TE regressions in areas such as FSDP2 checkpointing, CPU offloading, and NCCL UBR. The previously flagged shell correctness issues (unset swallowing python3, multiline bash -c splitting) are resolved in the current version — environment variables are set with proper export/unset statements and python3 is invoked directly with backslash continuations.

Confidence Score: 5/5

Safe to merge; all previously flagged blocking issues are resolved and remaining findings are minor P2 suggestions.

The three prior P0/P1 issues (unset-before-python3 swallowing the training command, multiline bash -c splitting, FP8 arch guard) are all addressed in the current revision. Only P2 items remain: the missing --save that would exercise the fsdp_dtensor checkpoint path, and a cosmetic missing newline in .gitignore. Neither blocks correctness of the training run.

qa/L1_pytorch_mcore_fsdp_integration/test.sh — consider adding --save to exercise the DCP checkpoint path called out in the PR description.

Vulnerabilities

No security concerns identified. The script clones a pinned, publicly known Megatron-LM commit over HTTPS and writes only to temporary/local paths within the TE QA directory.

Important Files Changed

Filename Overview
qa/L1_pytorch_mcore_fsdp_integration/test.sh New E2E CI test for Megatron-FSDP; previous thread issues (unset-before-python3, bash -c multiline split) are resolved; DCP checkpointing flag present but no --save dir means that specific path isn't exercised
qa/L1_pytorch_mcore_fsdp_integration/.gitignore Correctly ignores cloned Megatron-LM dir and generated vocab.json; missing trailing newline (cosmetic)
qa/L1_pytorch_mcore_fsdp_integration/merges.txt Stub BPE merges file with only the version header; safe for mock-data training which does not invoke the tokenizer

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[test.sh entry] --> B{Megatron-LM cloned?}
    B -- No --> C[git clone NVIDIA/Megatron-LM\ngit checkout MCORE_REF]
    C --> D
    B -- Yes --> D[Create mock vocab.json\n4096-token BPE stub]
    D --> E[unset CUDA_DEVICE_MAX_CONNECTIONS\nexport NVTE_* env vars]
    E --> F[python3 -m torch.distributed.launch\nnproc = nvidia-smi GPU count]
    F --> G[pretrain_gpt.py\n--use-megatron-fsdp\n--fp8-recipe mxfp8\n--cpu-offloading-num-layers 1\n--ckpt-format fsdp_dtensor\n10 train iters]
    G --> H{exit 0?}
    H -- Yes --> I[CI pass]
    H -- No --> J[CI fail / set -e abort]
Loading

Reviews (10): Last reviewed commit: "Expose MCore hash/tag as an argument to ..." | Re-trigger Greptile

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Cory Ye <44509866+cspades@users.noreply.github.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
timmoon10
timmoon10 previously approved these changes Apr 7, 2026
Copy link
Copy Markdown
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending CI

@timmoon10
Copy link
Copy Markdown
Collaborator

Pipeline 47956532

Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
@cspades cspades force-pushed the cye/mfsdp-te-e2e-test branch from 5fb4871 to fce5369 Compare April 8, 2026 00:51
@cspades
Copy link
Copy Markdown
Member Author

cspades commented Apr 8, 2026

Depends on this: NVIDIA/Megatron-LM#4133

This PR correctly uses decoupled_grad depending on the FP8 recipe matching the distributed optimizer logic in Megatron-Core.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants