Skip to content

Conversation

@IPostYellow
Copy link

Motivation

This PR introduces support for Stable Diffusion 3 Medium (stabilityai/stable-diffusion-3-medium-diffusers) text-to-image (t2i) generation in SGLang.

run with cli:

sglang generate --model-path=/your path/stabilityai/stable-diffusion-3-medium-diffusers    --prompt='A dreamy twilight scene of a small village floating among soft clouds, its rooftops adorned with glowing iridescent tiles that shimmer in hues of pearl and lavender. The winding streets are paved with translucent crystal, reflecting the warm glow of lanterns shaped like hot air balloons drifting gently into the sky. In the distance, layered floating mountains rise into the atmosphere, crowned with an ancient library made of marble and stained glass, where fluttering pages transform into flocks of luminous birds. In the foreground, a massive cherry blossom tree stretches across the frame, its petals falling like stardust, trailing soft light as they drift downward. The art style blends the hand-painted charm of Studio Ghibli with the refined lighting and depth of digital painting—vibrant yet ethereal colors, delicate linework, and a sense of quiet wonder. No people present, evoking serenity, magic, and infinite imagination.'  --width=720 --height=720 --save-output --dit-cpu-offload false --text-encoder-cpu-offload false --image-encoder-cpu-offload false --vae-cpu-offload false --pin-cpu-memory false 

Output:
A_dreamy_twilight_scene_of_a_small_village_floating_among_soft_clouds_its_rooftops_adorned_with_glo_20251117-165103_5b53fefb

or starts a model inference server and generates an image via API call.
Start the server:

sglang serve --model-path /your path/stabilityai/stable-diffusion-3-medium-diffusers  --num-gpus 2 --tp-size 2 --save-output --dit-cpu-offload false --text-encoder-cpu-offload false --image-encoder-cpu-offload false --vae-cpu-offload false --pin-cpu-memory false 

Send generation request:

import requests

url = "http://localhost:3000/v1/images/generations"
data = {
    'prompt': 'A curious raccoon',
    'size':'720x720',
    "output_format": "png",
    "response_format": "b64_json"
}
headers = {
    "Content-Type": "application/json"
}
response = requests.post(url,headers=headers,json=data)
print("Status Code:", response.status_code)
print("Response Body:", response.text)

Output:
a57cb381-33f6-4dc0-9d97-525dec08d5d0

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @IPostYellow, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates Stable Diffusion 3 Medium into SGLang, expanding its multimodal generation capabilities to include state-of-the-art text-to-image synthesis. The changes involve adding new model configurations, implementing a dedicated pipeline, and adapting core runtime components to support SD3's complex multi-text-encoder architecture and VAE processing, ensuring seamless operation and high-quality image generation.

Highlights

  • Stable Diffusion 3 Medium Support: This pull request introduces comprehensive support for Stable Diffusion 3 Medium (SD3) text-to-image generation within the SGLang framework, enabling users to leverage this advanced model for image synthesis.
  • New Configuration Files: Dedicated configuration files have been added for the SD3 Transformer (DiT), VAE, and the overall pipeline, defining their specific architectures, parameters, and operational settings.
  • Multi-Text-Encoder Integration: The system now supports SD3's unique architecture, which utilizes three distinct text encoders (two CLIP and one T5) for processing prompts, with corresponding adjustments in text encoding and conditioning stages.
  • Runtime Stage Enhancements: Modifications were made to the conditioning, latent_preparation, and text_encoding pipeline stages to correctly handle SD3's multi-encoder outputs, VAE scaling factors, and specific tokenizer settings.
  • Dynamic VAE Loading: The VAE component loader has been updated to intelligently prioritize fp16 or full precision safetensors files when loading the SD3 VAE, optimizing for performance and compatibility.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@IPostYellow IPostYellow changed the title [MultiModal]Support stable-diffusion-3-medium-diffusers for t2i [MultiModal]Support stable-diffusion-3-medium-diffusers Nov 17, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Stable Diffusion 3 Medium for text-to-image generation. The changes are comprehensive, touching configuration, model implementation, and pipeline stages. The implementation correctly handles the three text encoders required by SD3. I've identified a few issues, including a bug in the SD3 transformer's forward pass return value and some brittle file loading logic. Additionally, I've provided suggestions to improve code clarity and maintainability by removing dead code and simplifying some expressions. Overall, this is a great contribution.

Comment on lines +476 to +489
if isinstance(server_args.pipeline_config, StableDiffusion3PipelineConfig):
precision = server_args.pipeline_config.vae_precision
base_name = "diffusion_pytorch_model"

# Priority: fp16 > full precision > any matching file
if precision == "fp16":
fp16_path = os.path.join(
str(model_path), f"{base_name}.fp16.safetensors"
)
target_files = [fp16_path] if os.path.exists(fp16_path) else []
else:
full_path = os.path.join(str(model_path), f"{base_name}.safetensors")
target_files = [full_path] if os.path.exists(full_path) else []
safetensors_list = target_files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current logic for finding the VAE's safetensors file is brittle. If the specific precision file (.fp16.safetensors or .safetensors) is not found, it results in an empty list, which will cause the assertion on line 491 to fail with a generic message. The comment on line 480 suggests a priority-based fallback, which is not fully implemented. I suggest a more robust implementation that correctly applies the priority and provides a better fallback.

        if isinstance(server_args.pipeline_config, StableDiffusion3PipelineConfig):
            precision = server_args.pipeline_config.vae_precision
            base_name = "diffusion_pytorch_model"

            # Priority: fp16 > full precision > any matching file
            fp16_path = os.path.join(str(model_path), f"{base_name}.fp16.safetensors")
            full_path = os.path.join(str(model_path), f"{base_name}.safetensors")

            if precision == "fp16" and os.path.exists(fp16_path):
                safetensors_list = [fp16_path]
            elif os.path.exists(full_path):
                safetensors_list = [full_path]
            elif os.path.exists(fp16_path):
                safetensors_list = [fp16_path]
            else:
                # Fallback to any safetensors file if specific ones are not found
                safetensors_list = glob.glob(os.path.join(str(model_path), f"{base_name}*.safetensors"))

Comment on lines 230 to 233
if not return_dict:
return (output,)

return output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When return_dict is True, the function should return a Transformer2DModelOutput object, but it currently returns a raw tensor. This can lead to AttributeError if the caller expects an object with a .sample attribute. Please wrap the output tensor in Transformer2DModelOutput.

Suggested change
if not return_dict:
return (output,)
return output
if not return_dict:
return (output,)
return Transformer2DModelOutput(sample=output)

_IMAGE_ENCODER_MODELS: dict[str, tuple] = {
# "HunyuanVideoTransformer3DModel": ("image_encoder", "hunyuanvideo", "HunyuanVideoImageEncoder"),
"CLIPVisionModelWithProjection": ("encoders", "clip", "CLIPVisionModel"),
"CLIPTextModelWithProjection": ("encoders", "clip", "CLIPTextModel"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variable _IMAGE_ENCODER_MODELS is misleading as it now contains a text model (CLIPTextModelWithProjection). To improve code clarity and maintainability, consider renaming it to something more generic, such as _ENCODER_MODELS.

Comment on lines 103 to 108
# if batch.do_classifier_free_guidance:
# prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds], dim=0)
# pooled_prompt_embeds = torch.cat([negative_pooled_prompt_embeds, pooled_prompt_embeds], dim=0)
# batch.prompt_embeds = [prompt_embeds]
# batch.pooled_embeds = [pooled_prompt_embeds]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of commented-out code appears to be dead code. Please remove it to improve code clarity.

Comment on lines +71 to +75
vae_scale_factor = (
server_args.pipeline_config.vae_config.get_vae_scale_factor()
if server_args.pipeline_config.vae_config.get_vae_scale_factor()
else 8
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This expression is a bit verbose and calls get_vae_scale_factor() twice. It can be simplified for better readability and to avoid the redundant call.

            scale_factor = server_args.pipeline_config.vae_config.get_vae_scale_factor()
            vae_scale_factor = scale_factor or 8

@mickqian
Copy link
Collaborator

awesome job, thanks! we'll get back to this PR once necessary CI tests and refactors are added

@mickqian mickqian changed the title [MultiModal]Support stable-diffusion-3-medium-diffusers diffusion model: support stable-diffusion-3-medium-diffusers Nov 18, 2025
IPostYellow and others added 6 commits November 18, 2025 20:55
…on3medium_fn2

# Conflicts:
#	python/sglang/multimodal_gen/configs/pipeline_configs/__init__.py
#	python/sglang/multimodal_gen/configs/pipeline_configs/stablediffusion3.py
#	python/sglang/multimodal_gen/registry.py
#	python/sglang/multimodal_gen/runtime/pipelines_core/stages/conditioning.py
#	python/sglang/multimodal_gen/runtime/pipelines_core/stages/text_encoding.py
@github-actions github-actions bot added the diffusion SGLang Diffusion label Nov 21, 2025
@IPostYellow
Copy link
Author

awesome job, thanks! we'll get back to this PR once necessary CI tests and refactors are added
Hi , thanks for the feedback!
Just wanted to let you know that:
--

  • All required CI tests are now passing
  • I've merged the latest architectural changes from main into this PR
  • The branch is ready for review whenever you have time
     
    Let me know if there's anything specific you'd like me to address!
    Thanks for your time.

logger = init_logger(__name__)


class TestStableDiffusionT2Image(TestGenerateBase):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, cli test is deprecated. Could we add it to test_server_a.py? Thanks

@IPostYellow IPostYellow force-pushed the support_stablediffusion3medium branch from ca55334 to 8a52f14 Compare November 27, 2025 06:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants