Add Megatron-FSDP E2E integration test to TE CI/CD (L1).#2845
Add Megatron-FSDP E2E integration test to TE CI/CD (L1).#2845cspades wants to merge 7 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Cory Ye <cye@nvidia.com>
Greptile SummaryThis PR adds a new L1 CI test ( Confidence Score: 5/5Safe to merge; all previously flagged blocking issues are resolved and remaining findings are minor P2 suggestions. The three prior P0/P1 issues (unset-before-python3 swallowing the training command, multiline bash -c splitting, FP8 arch guard) are all addressed in the current revision. Only P2 items remain: the missing --save that would exercise the fsdp_dtensor checkpoint path, and a cosmetic missing newline in .gitignore. Neither blocks correctness of the training run. qa/L1_pytorch_mcore_fsdp_integration/test.sh — consider adding --save to exercise the DCP checkpoint path called out in the PR description.
|
| Filename | Overview |
|---|---|
| qa/L1_pytorch_mcore_fsdp_integration/test.sh | New E2E CI test for Megatron-FSDP; previous thread issues (unset-before-python3, bash -c multiline split) are resolved; DCP checkpointing flag present but no --save dir means that specific path isn't exercised |
| qa/L1_pytorch_mcore_fsdp_integration/.gitignore | Correctly ignores cloned Megatron-LM dir and generated vocab.json; missing trailing newline (cosmetic) |
| qa/L1_pytorch_mcore_fsdp_integration/merges.txt | Stub BPE merges file with only the version header; safe for mock-data training which does not invoke the tokenizer |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[test.sh entry] --> B{Megatron-LM cloned?}
B -- No --> C[git clone NVIDIA/Megatron-LM\ngit checkout MCORE_REF]
C --> D
B -- Yes --> D[Create mock vocab.json\n4096-token BPE stub]
D --> E[unset CUDA_DEVICE_MAX_CONNECTIONS\nexport NVTE_* env vars]
E --> F[python3 -m torch.distributed.launch\nnproc = nvidia-smi GPU count]
F --> G[pretrain_gpt.py\n--use-megatron-fsdp\n--fp8-recipe mxfp8\n--cpu-offloading-num-layers 1\n--ckpt-format fsdp_dtensor\n10 train iters]
G --> H{exit 0?}
H -- Yes --> I[CI pass]
H -- No --> J[CI fail / set -e abort]
Reviews (10): Last reviewed commit: "Expose MCore hash/tag as an argument to ..." | Re-trigger Greptile
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Cory Ye <44509866+cspades@users.noreply.github.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
|
Pipeline 47956532 |
f93ebbc to
e521694
Compare
Signed-off-by: Cory Ye <cye@nvidia.com>
5fb4871 to
fce5369
Compare
|
Depends on this: NVIDIA/Megatron-LM#4133 This PR correctly uses |
Signed-off-by: Cory Ye <cye@nvidia.com>
Description
Details
decoupled_gradbugs related to FusedAdam, and other less obvious CPU offloading & Tensor API bugs that are difficult to catch without running Megatron-FSDP. This functional test aims to reduce the frequency of that.Type of change
Changes
Please list the changes introduced in this PR:
Checklist: