feat: support 🔥FLUX.2 context parallel (#492)

DefTruth · web-flow · commit b9fd68e66c27 · 2025-11-27T21:33:47.000+08:00
* feat: support hybrid cache + tp for flux.2

* feat: enable seq offload for FLUX.2 w/ GPU=1

* feat: support FLUX.2 context parallel

* feat: support FLUX.2 context parallel

* feat: support FLUX.2 context parallel

* feat: support FLUX.2 context parallel

* feat: support FLUX.2 context parallel

* feat: support FLUX.2 context parallel
diff --git a/README.md b/README.md
@@ -114,7 +114,7 @@ The comparison between **cache-dit** and other algorithms shows that within a sp
 
 | 📚Model | Cache  | CP | TP | 📚Model | Cache  | CP | TP |
 |:---|:---|:---|:---|:---|:---|:---|:---|
-| **🔥[FLUX.2: 56B](https://github.com/vipshop/cache-dit/blob/main/examples/pipeline)** | ✔️🔥 | ✖️ | ✔️🔥 | **🎉[FLUX.1 `Q`](https://github.com/vipshop/cache-dit/blob/main/examples/pipeline)** | ✔️ | ✔️ | ✖️ |
+| **🔥[FLUX.2: 56B](https://github.com/vipshop/cache-dit/blob/main/examples/pipeline)** | ✔️🔥 | ✔️🔥 | ✔️🔥 | **🎉[FLUX.1 `Q`](https://github.com/vipshop/cache-dit/blob/main/examples/pipeline)** | ✔️ | ✔️ | ✖️ |
 | **🎉[FLUX.1](https://github.com/vipshop/cache-dit/blob/main/examples/pipeline)** | ✔️ | ✔️ | ✔️ | **🎉[FLUX.1-Fill `Q`](https://github.com/vipshop/cache-dit/blob/main/examples/pipeline)** | ✔️ | ✔️ | ✖️ |
 | **🎉[FLUX.1-Fill](https://github.com/vipshop/cache-dit/blob/main/examples/pipeline)** | ✔️ | ✔️ | ✔️ | **🎉[Qwen-Image `Q`](https://github.com/vipshop/cache-dit/blob/main/examples/pipeline)** | ✔️ | ✔️ | ✖️ |
 | **🎉[Qwen-Image](https://github.com/vipshop/cache-dit/blob/main/examples/pipeline)** | ✔️ | ✔️ | ✔️ | **🎉[Qwen...Edit `Q`](https://github.com/vipshop/cache-dit/blob/main/examples/pipeline)** | ✔️ | ✔️ | ✖️ |
diff --git a/docs/User_Guide.md b/docs/User_Guide.md
@@ -75,13 +75,13 @@ Currently, **cache-dit** library supports almost **Any** Diffusion Transformers
 ```
 
 > [!Tip] 
-> One **Model Series** may contain **many** pipelines. cache-dit applies optimizations at the **Transformer** level; thus, any pipelines that include the supported transformer are already supported by cache-dit. ✔️: known work and official supported now; ✖️: unofficial supported now, but maybe support in the future; **[`Q`](https://github.com/nunchaku-tech/nunchaku)**: **4-bits** models w/ [nunchaku](https://github.com/nunchaku-tech/nunchaku) + SVDQ **W4A4**.
+> One **Model Series** may contain **many** pipelines. cache-dit applies optimizations at the **Transformer** level; thus, any pipelines that include the supported transformer are already supported by cache-dit. ✔️: known work and official supported now; ✖️: unofficial supported now, but maybe support in the future; **[`Q`](https://github.com/nunchaku-tech/nunchaku)**: **4-bits** models w/ [nunchaku](https://github.com/nunchaku-tech/nunchaku) + SVDQ **W4A4**; **🔥FLUX.2**: 24B + 32B = 56B.
 
 <div align="center">
 
 | 📚Model | Cache  | CP | TP | 📚Model | Cache  | CP | TP |
 |:---|:---|:---|:---|:---|:---|:---|:---|
-| **🔥🔥[FLUX.2](https://github.com/vipshop/cache-dit/blob/main/examples/pipeline)** | 🔥✔️ | ✖️ | 🔥✔️ | **🎉[FLUX.1 `Q`](https://github.com/vipshop/cache-dit/blob/main/examples/pipeline)** | ✔️ | ✔️ | ✖️ |
+| **🔥[FLUX.2: 56B](https://github.com/vipshop/cache-dit/blob/main/examples/pipeline)** | ✔️🔥 | ✔️🔥 | ✔️🔥 | **🎉[FLUX.1 `Q`](https://github.com/vipshop/cache-dit/blob/main/examples/pipeline)** | ✔️ | ✔️ | ✖️ |
 | **🎉[FLUX.1](https://github.com/vipshop/cache-dit/blob/main/examples/pipeline)** | ✔️ | ✔️ | ✔️ | **🎉[FLUX.1-Fill `Q`](https://github.com/vipshop/cache-dit/blob/main/examples/pipeline)** | ✔️ | ✔️ | ✖️ |
 | **🎉[FLUX.1-Fill](https://github.com/vipshop/cache-dit/blob/main/examples/pipeline)** | ✔️ | ✔️ | ✔️ | **🎉[Qwen-Image `Q`](https://github.com/vipshop/cache-dit/blob/main/examples/pipeline)** | ✔️ | ✔️ | ✖️ |
 | **🎉[Qwen-Image](https://github.com/vipshop/cache-dit/blob/main/examples/pipeline)** | ✔️ | ✔️ | ✔️ | **🎉[Qwen...Edit `Q`](https://github.com/vipshop/cache-dit/blob/main/examples/pipeline)** | ✔️ | ✔️ | ✖️ |
@@ -703,7 +703,7 @@ As we can observe, in the case of **static cache**, the image of `SCM Slow S*` (
 
 <div id="context-parallelism"></div>
 
-cache-dit is compatible with context parallelism. Currently, we support the use of `Hybrid Cache` + `Context Parallelism` scheme (via NATIVE_DIFFUSER parallelism backend) in cache-dit. Users can use Context Parallelism to further accelerate the speed of inference! For more details, please refer to [📚examples/parallelism](https://github.com/vipshop/cache-dit/tree/main/examples/parallelism). Currently, cache-dit supported context parallelism for [FLUX.1](https://huggingface.co/black-forest-labs/FLUX.1-dev), [Qwen-Image](https://github.com/QwenLM/Qwen-Image), [Qwen-Image-Lightning](https://github.com/ModelTC/Qwen-Image-Lightning), [LTXVideo](https://huggingface.co/Lightricks/LTX-Video), [Wan 2.1](https://github.com/Wan-Video/Wan2.1), [Wan 2.2](https://github.com/Wan-Video/Wan2.2), [HunyuanImage-2.1](https://huggingface.co/tencent/HunyuanImage-2.1), [HunyuanVideo](https://huggingface.co/hunyuanvideo-community/HunyuanVideo), [CogVideoX 1.0](https://github.com/zai-org/CogVideo), [CogVideoX 1.5](https://github.com/zai-org/CogVideo), [CogView 3/4](https://github.com/zai-org/CogView4) and [VisualCloze](https://github.com/lzyhha/VisualCloze), etc. cache-dit will support more models in the future.
+cache-dit is compatible with context parallelism. Currently, we support the use of `Hybrid Cache` + `Context Parallelism` scheme (via NATIVE_DIFFUSER parallelism backend) in cache-dit. Users can use Context Parallelism to further accelerate the speed of inference! For more details, please refer to [📚examples/parallelism](https://github.com/vipshop/cache-dit/tree/main/examples/parallelism). Currently, cache-dit supported context parallelism for [FLUX.1](https://huggingface.co/black-forest-labs/FLUX.1-dev), 🔥[FLUX.2](https://huggingface.co/black-forest-labs/FLUX.2-dev), [Qwen-Image](https://github.com/QwenLM/Qwen-Image), [Qwen-Image-Lightning](https://github.com/ModelTC/Qwen-Image-Lightning), [LTXVideo](https://huggingface.co/Lightricks/LTX-Video), [Wan 2.1](https://github.com/Wan-Video/Wan2.1), [Wan 2.2](https://github.com/Wan-Video/Wan2.2), [HunyuanImage-2.1](https://huggingface.co/tencent/HunyuanImage-2.1), [HunyuanVideo](https://huggingface.co/hunyuanvideo-community/HunyuanVideo), [CogVideoX 1.0](https://github.com/zai-org/CogVideo), [CogVideoX 1.5](https://github.com/zai-org/CogVideo), [CogView 3/4](https://github.com/zai-org/CogView4) and [VisualCloze](https://github.com/lzyhha/VisualCloze), etc. cache-dit will support more models in the future.
 
 ```python
 # pip3 install "cache-dit[parallelism]"
diff --git a/examples/parallelism/run_flux2_cp.py b/examples/parallelism/run_flux2_cp.py
@@ -0,0 +1,175 @@
+import os
+import sys
+
+sys.path.append("..")
+
+import time
+
+import torch
+from diffusers import Flux2Pipeline, Flux2Transformer2DModel
+from diffusers.quantizers import PipelineQuantizationConfig
+
+from utils import (
+    MemoryTracker,
+    GiB,
+    cachify,
+    get_args,
+    maybe_destroy_distributed,
+    maybe_init_distributed,
+    strify,
+)
+
+import cache_dit
+
+args = get_args()
+print(args)
+
+rank, device = maybe_init_distributed(args)
+
+if GiB() < 128:
+    assert args.quantize, "Quantization is required to fit FLUX.2 in <128GB memory."
+    assert args.quantize_type in ["bitsandbytes_4bit", "float8_weight_only"], (
+        f"Unsupported quantization type: {args.quantize_type}, only supported"
+        "'bitsandbytes_4bit (bnb_4bit)' and 'float8_weight_only'."
+    )
+
+pipe: Flux2Pipeline = Flux2Pipeline.from_pretrained(
+    (
+        args.model_path
+        if args.model_path is not None
+        else os.environ.get(
+            "FLUX_2_DIR",
+            "black-forest-labs/FLUX.2-dev",
+        )
+    ),
+    torch_dtype=torch.bfloat16,
+    quantization_config=(
+        (
+            PipelineQuantizationConfig(
+                quant_backend="bitsandbytes_4bit",
+                quant_kwargs={
+                    "load_in_4bit": True,
+                    "bnb_4bit_quant_type": "nf4",
+                    "bnb_4bit_compute_dtype": torch.bfloat16,
+                },
+                # 112/4 = 28GB total for text_encoder + transformer in 4-bit
+                components_to_quantize=["text_encoder", "transformer"],
+            )
+        )
+        if args.quantize and args.quantize_type in ("bitsandbytes_4bit",)
+        else None
+    ),
+)
+
+if args.quantize and args.quantize_type == "float8_weight_only":
+    pipe.transformer = cache_dit.quantize(
+        pipe.transformer,
+        quant_type=args.quantize_type,
+        exclude_layers=[
+            "img_in",
+            "txt_in",
+        ],
+    )
+    pipe.text_encoder = cache_dit.quantize(
+        pipe.text_encoder,
+        quant_type=args.quantize_type,
+    )
+
+if args.cache or args.parallel_type is not None:
+    from cache_dit import DBCacheConfig, ParamsModifier
+
+    cachify(
+        args,
+        pipe,
+        extra_parallel_modules=(
+            # Specify extra modules to be parallelized in addition to the main transformer,
+            # e.g., text_encoder_2 in FluxPipeline, text_encoder in Flux2Pipeline. Currently,
+            # only supported in native pytorch backend (namely, Tensor Parallelism).
+            [pipe.text_encoder]
+            if args.parallel_type == "tp"
+            else []
+        ),
+        params_modifiers=[
+            ParamsModifier(
+                # Modified config only for transformer_blocks
+                # Must call the `reset` method of DBCacheConfig.
+                cache_config=DBCacheConfig().reset(
+                    residual_diff_threshold=args.rdt,
+                ),
+            ),
+            ParamsModifier(
+                # Modified config only for single_transformer_blocks
+                # NOTE: FLUX.2, single_transformer_blocks should have `higher`
+                # residual_diff_threshold because of the precision error
+                # accumulation from previous transformer_blocks
+                cache_config=DBCacheConfig().reset(
+                    residual_diff_threshold=args.rdt * 3,
+                ),
+            ),
+        ],
+    )
+
+torch.cuda.empty_cache()
+
+if args.quantize_type == "bitsandbytes_4bit":
+    pipe.to(device)
+else:
+    pipe.enable_model_cpu_offload(device=device)
+
+assert isinstance(pipe.transformer, Flux2Transformer2DModel)
+
+pipe.set_progress_bar_config(disable=rank != 0)
+
+prompt = (
+    "Realistic macro photograph of a hermit crab using a soda can as its shell, "
+    "partially emerging from the can, captured with sharp detail and natural colors, "
+    "on a sunlit beach with soft shadows and a shallow depth of field, with blurred ocean "
+    "waves in the background. The can has the text `BFL Diffusers` on it and it has a color "
+    "gradient that start with #FF5733 at the top and transitions to #33FF57 at the bottom."
+)
+
+if args.prompt is not None:
+    prompt = args.prompt
+
+
+def run_pipe(warmup: bool = False):
+    generator = torch.Generator("cpu").manual_seed(0)
+    image = pipe(
+        prompt=prompt,
+        # 28 steps can be a good trade-off
+        num_inference_steps=5 if warmup else (28 if args.steps is None else args.steps),
+        guidance_scale=4,
+        generator=generator,
+    ).images[0]
+    return image
+
+
+if args.compile:
+    cache_dit.set_compile_configs()
+    pipe.transformer = torch.compile(pipe.transformer)
+
+# warmup
+_ = run_pipe(warmup=True)
+
+memory_tracker = MemoryTracker() if args.track_memory else None
+if memory_tracker:
+    memory_tracker.__enter__()
+
+start = time.time()
+image = run_pipe()
+end = time.time()
+
+if memory_tracker:
+    memory_tracker.__exit__(None, None, None)
+    memory_tracker.report()
+
+if rank == 0:
+    cache_dit.summary(pipe)
+
+    time_cost = end - start
+    save_path = f"flux2.{strify(args, pipe)}.png"
+    print(f"Time cost: {time_cost:.2f}s")
+    print(f"Saving image to {save_path}")
+    image.save(save_path)
+
+maybe_destroy_distributed()
diff --git a/examples/utils.py b/examples/utils.py
@@ -88,6 +88,7 @@ def get_args(
             "int4",
             "int4_weight_only",
             "bitsandbytes_4bit",
+            "bnb_4bit",  # alias for bitsandbytes_4bit
         ],
     )
     parser.add_argument(
@@ -150,7 +151,11 @@ def get_args(
         default=False,
         help="Disable compute-communication overlap during compilation",
     )
-    return parser.parse_args() if parse else parser
+    args_or_parser = parser.parse_args() if parse else parser
+    if parse:
+        if args_or_parser.quantize_type == "bnb_4bit":  # alias
+            args_or_parser.quantize_type = "bitsandbytes_4bit"
+    return args_or_parser
 
 
 def cachify(
diff --git a/src/cache_dit/parallelism/backends/native_diffusers/context_parallelism/cp_plan_flux.py b/src/cache_dit/parallelism/backends/native_diffusers/context_parallelism/cp_plan_flux.py
@@ -40,7 +40,7 @@
 logger = init_logger(__name__)
 
 
-@ContextParallelismPlannerRegister.register("Flux")
+@ContextParallelismPlannerRegister.register("FluxTransformer2DModel")
 class FluxContextParallelismPlanner(ContextParallelismPlanner):
     def apply(
         self,
diff --git a/src/cache_dit/parallelism/backends/native_diffusers/context_parallelism/cp_plan_flux2.py b/src/cache_dit/parallelism/backends/native_diffusers/context_parallelism/cp_plan_flux2.py
@@ -0,0 +1,88 @@
+import torch
+from typing import Optional
+from diffusers.models.modeling_utils import ModelMixin
+from diffusers import Flux2Transformer2DModel
+
+try:
+    from diffusers.models._modeling_parallel import (
+        ContextParallelInput,
+        ContextParallelOutput,
+        ContextParallelModelPlan,
+    )
+except ImportError:
+    raise ImportError(
+        "Context parallelism requires the 'diffusers>=0.36.dev0'."
+        "Please install latest version of diffusers from source: \n"
+        "pip3 install git+https://github.com/huggingface/diffusers.git"
+    )
+from .cp_plan_registers import (
+    ContextParallelismPlanner,
+    ContextParallelismPlannerRegister,
+)
+
+from cache_dit.logger import init_logger
+
+logger = init_logger(__name__)
+
+
+@ContextParallelismPlannerRegister.register("Flux2Transformer2DModel")
+class Flux2ContextParallelismPlanner(ContextParallelismPlanner):
+    def apply(
+        self,
+        transformer: Optional[torch.nn.Module | ModelMixin] = None,
+        **kwargs,
+    ) -> ContextParallelModelPlan:
+
+        # NOTE: Diffusers native CP plan still have bugs for Flux2 now.
+        self._cp_planner_preferred_native_diffusers = False
+
+        if transformer is not None and self._cp_planner_preferred_native_diffusers:
+            assert isinstance(
+                transformer, Flux2Transformer2DModel
+            ), "Transformer must be an instance of Flux2Transformer2DModel"
+            if hasattr(transformer, "_cp_plan"):
+                if transformer._cp_plan is not None:
+                    return transformer._cp_plan
+
+        # Otherwise, use the custom CP plan defined here, this maybe
+        # a little different from the native diffusers implementation
+        # for some models.
+        _cp_plan = {
+            # Here is a Transformer level CP plan for Flux, which will
+            # only apply the only 1 split hook (pre_forward) on the forward
+            # of Transformer, and gather the output after Transformer forward.
+            # Pattern of transformer forward, split_output=False:
+            #     un-split input -> splited input (inside transformer)
+            # Pattern of the transformer_blocks, single_transformer_blocks:
+            #     splited input (previous splited output) -> to_qkv/...
+            #     -> all2all
+            #     -> attn (local head, full seqlen)
+            #     -> all2all
+            #     -> splited output
+            # The `hidden_states` and `encoder_hidden_states` will still keep
+            # itself splited after block forward (namely, automatic split by
+            # the all2all comm op after attn) for the all blocks.
+            # img_ids and txt_ids will only be splited once at the very beginning,
+            # and keep splited through the whole transformer forward. The all2all
+            # comm op only happens on the `out` tensor after local attn not on
+            # img_ids and txt_ids.
+            "": {
+                "hidden_states": ContextParallelInput(
+                    split_dim=1, expected_dims=3, split_output=False
+                ),
+                "encoder_hidden_states": ContextParallelInput(
+                    split_dim=1, expected_dims=3, split_output=False
+                ),
+                "img_ids": ContextParallelInput(split_dim=1, expected_dims=3, split_output=False),
+                "txt_ids": ContextParallelInput(split_dim=1, expected_dims=3, split_output=False),
+            },
+            # Then, the final proj_out will gather the splited output.
+            #     splited input (previous splited output)
+            #     -> all gather
+            #     -> un-split output
+            "proj_out": ContextParallelOutput(gather_dim=1, expected_dims=3),
+        }
+        return _cp_plan
+
+
+# TODO: Add async Ulysses QKV proj for FLUX2 model
diff --git a/src/cache_dit/parallelism/backends/native_diffusers/context_parallelism/cp_planners.py b/src/cache_dit/parallelism/backends/native_diffusers/context_parallelism/cp_planners.py
@@ -76,6 +76,7 @@
 from .cp_plan_dit import DiTContextParallelismPlanner
 from .cp_plan_kandinsky import Kandinsky5ContextParallelismPlanner
 from .cp_plan_skyreels import SkyReelsV2ContextParallelismPlanner
+from .cp_plan_flux2 import Flux2ContextParallelismPlanner
 
 try:
     import nunchaku  # noqa: F401
@@ -112,6 +113,7 @@
     "DiTContextParallelismPlanner",
     "Kandinsky5ContextParallelismPlanner",
     "SkyReelsV2ContextParallelismPlanner",
+    "Flux2ContextParallelismPlanner",
 ]
 
 if _nunchaku_available:
diff --git a/src/cache_dit/utils.py b/src/cache_dit/utils.py
@@ -54,6 +54,14 @@ def print_tensor(
     if disable:
         return
 
+    if x is None:
+        print(f"{name} is None")
+        return
+
+    if not isinstance(x, torch.Tensor):
+        print(f"{name} is not a tensor, type: {type(x)}")
+        return
+
     x = x.contiguous()
     if torch.distributed.is_initialized():
         # all gather hidden_states and check values mean