Add Megatron-FSDP E2E integration test to TE CI/CD (L1). by cspades · Pull Request #2845 · NVIDIA/TransformerEngine

cspades · 2026-04-07T18:25:42Z

Description

Adds Megatron-FSDP E2E integration tests to the TransformerEngine CI/CD.

Details

Megatron-FSDP has experienced various issues with TransformerEngine in the past months, from race conditions on gradients with deterministic FlashAttention, to FSDP2 / Megatron-FSDP DCP checkpointing or decoupled_grad bugs related to FusedAdam, and other less obvious CPU offloading & Tensor API bugs that are difficult to catch without running Megatron-FSDP. This functional test aims to reduce the frequency of that.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Cory Ye <cye@nvidia.com>

greptile-apps · 2026-04-07T18:31:04Z

Greptile Summary

This PR adds a new L1 CI test (qa/L1_pytorch_mcore_fsdp_integration/) that runs a short Megatron-FSDP GPT pre-training job to catch TE regressions in areas such as FSDP2 checkpointing, CPU offloading, and NCCL UBR. The previously flagged shell correctness issues (unset swallowing python3, multiline bash -c splitting) are resolved in the current version — environment variables are set with proper export/unset statements and python3 is invoked directly with backslash continuations.

Confidence Score: 5/5

Safe to merge; all previously flagged blocking issues are resolved and remaining findings are minor P2 suggestions.

The three prior P0/P1 issues (unset-before-python3 swallowing the training command, multiline bash -c splitting, FP8 arch guard) are all addressed in the current revision. Only P2 items remain: the missing --save that would exercise the fsdp_dtensor checkpoint path, and a cosmetic missing newline in .gitignore. Neither blocks correctness of the training run.

qa/L1_pytorch_mcore_fsdp_integration/test.sh — consider adding --save to exercise the DCP checkpoint path called out in the PR description.

Vulnerabilities

No security concerns identified. The script clones a pinned, publicly known Megatron-LM commit over HTTPS and writes only to temporary/local paths within the TE QA directory.

Important Files Changed

Filename	Overview
qa/L1_pytorch_mcore_fsdp_integration/test.sh	New E2E CI test for Megatron-FSDP; previous thread issues (unset-before-python3, bash -c multiline split) are resolved; DCP checkpointing flag present but no --save dir means that specific path isn't exercised
qa/L1_pytorch_mcore_fsdp_integration/.gitignore	Correctly ignores cloned Megatron-LM dir and generated vocab.json; missing trailing newline (cosmetic)
qa/L1_pytorch_mcore_fsdp_integration/merges.txt	Stub BPE merges file with only the version header; safe for mock-data training which does not invoke the tokenizer

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[test.sh entry] --> B{Megatron-LM cloned?}
    B -- No --> C[git clone NVIDIA/Megatron-LM\ngit checkout MCORE_REF]
    C --> D
    B -- Yes --> D[Create mock vocab.json\n4096-token BPE stub]
    D --> E[unset CUDA_DEVICE_MAX_CONNECTIONS\nexport NVTE_* env vars]
    E --> F[python3 -m torch.distributed.launch\nnproc = nvidia-smi GPU count]
    F --> G[pretrain_gpt.py\n--use-megatron-fsdp\n--fp8-recipe mxfp8\n--cpu-offloading-num-layers 1\n--ckpt-format fsdp_dtensor\n10 train iters]
    G --> H{exit 0?}
    H -- Yes --> I[CI pass]
    H -- No --> J[CI fail / set -e abort]

_{Reviews (10): Last reviewed commit: "Expose MCore hash/tag as an argument to ..." | Re-trigger Greptile}

qa/L1_pytorch_mcore_fsdp_integration/test.sh

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Cory Ye <44509866+cspades@users.noreply.github.com>

qa/L1_pytorch_mcore_fsdp_integration/test.sh

Signed-off-by: Cory Ye <cye@nvidia.com>

timmoon10

LGTM, pending CI

timmoon10 · 2026-04-07T20:53:21Z

Pipeline 47956532

Signed-off-by: Cory Ye <cye@nvidia.com>

cspades · 2026-04-08T01:47:11Z

Depends on this: NVIDIA/Megatron-LM#4133

This PR correctly uses decoupled_grad depending on the FP8 recipe matching the distributed optimizer logic in Megatron-Core.

qa/L1_pytorch_mcore_fsdp_integration/test.sh

Signed-off-by: Cory Ye <cye@nvidia.com>

Add Megatron-FSDP E2E integration test to TE CI/CD (L1).

5023d50

Signed-off-by: Cory Ye <cye@nvidia.com>

greptile-apps bot reviewed Apr 7, 2026

View reviewed changes

qa/L1_pytorch_mcore_fsdp_integration/test.sh Outdated Show resolved Hide resolved

qa/L1_pytorch_mcore_fsdp_integration/test.sh Outdated Show resolved Hide resolved

Update qa/L1_pytorch_mcore_fsdp_integration/test.sh

4374e7f

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Cory Ye <44509866+cspades@users.noreply.github.com>

greptile-apps bot reviewed Apr 7, 2026

View reviewed changes

qa/L1_pytorch_mcore_fsdp_integration/test.sh Outdated Show resolved Hide resolved

Explicit torchrun invoke.

dcd334a

Signed-off-by: Cory Ye <cye@nvidia.com>

timmoon10 previously approved these changes Apr 7, 2026

View reviewed changes

Edits.

e521694

Signed-off-by: Cory Ye <cye@nvidia.com>

cspades dismissed timmoon10’s stale review via e521694 April 7, 2026 21:39

cspades force-pushed the cye/mfsdp-te-e2e-test branch from f93ebbc to e521694 Compare April 7, 2026 23:13

Remove CPU initialization, add FW args.

fce5369

Signed-off-by: Cory Ye <cye@nvidia.com>

cspades force-pushed the cye/mfsdp-te-e2e-test branch from 5fb4871 to fce5369 Compare April 8, 2026 00:51

cspades mentioned this pull request Apr 8, 2026

Fix incorrectly set decoupled_grad and DistOpt mechanics for MFSDP. NVIDIA/Megatron-LM#4133

Open

5 tasks

timmoon10 self-assigned this Apr 8, 2026

vthumbe1503 reviewed Apr 9, 2026

View reviewed changes

qa/L1_pytorch_mcore_fsdp_integration/test.sh Outdated Show resolved Hide resolved

vthumbe1503 and others added 2 commits April 8, 2026 21:35

Merge branch 'main' into cye/mfsdp-te-e2e-test

3163cab

Expose MCore hash/tag as an argument to the E2E script.

e67e32f

Signed-off-by: Cory Ye <cye@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Megatron-FSDP E2E integration test to TE CI/CD (L1).#2845

Add Megatron-FSDP E2E integration test to TE CI/CD (L1).#2845
cspades wants to merge 7 commits intoNVIDIA:mainfrom
cspades:cye/mfsdp-te-e2e-test

cspades commented Apr 7, 2026

Uh oh!

greptile-apps bot commented Apr 7, 2026 •

edited

Loading

Vulnerabilities

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timmoon10 left a comment

Uh oh!

timmoon10 commented Apr 7, 2026

Uh oh!

cspades commented Apr 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cspades commented Apr 7, 2026

Description

Details

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Vulnerabilities

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

timmoon10 commented Apr 7, 2026

Uh oh!

cspades commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps bot commented Apr 7, 2026 •

edited

Loading

cspades commented Apr 8, 2026 •

edited

Loading