Skip to content

Comments

UPSTREAM PR #1274: feat: option to enable vae tiling automatically for large images#60

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-1274-sd_vae_tiling_threshold
Open

UPSTREAM PR #1274: feat: option to enable vae tiling automatically for large images#60
loci-dev wants to merge 1 commit intomainfrom
loci/pr-1274-sd_vae_tiling_threshold

Conversation

@loci-dev
Copy link

Note

Source pull request: leejet/stable-diffusion.cpp#1274

This allows controlling the use of VAE tiling according to the requested image size: tiling will be enabled only for images larger than a given threshold.

Koboldcpp has this same feature. It's mainly useful when the size isn't known at launch time, like in the sd-server, and when the size comes from an input image.

The size is specified by a square image side: --vae-tiling 768 means enabling tiling for images larger than 768x768 - mainly because it's easy to test a NxN image, and more meaningful than something like 'megapixels'. Another possibility could be specifying the size with a string like "768x768". A memory threshold would arguably be better, but harder to implement, since the exact usage depends a lot on the specific backend and flags.

To keep it compatible with current usage, and still avoid an extra flag, the value is optional: --vae-tiling alone is the same as --vae-tiling 1. --vae-tiling 0 also means no tiling.

@loci-dev loci-dev temporarily deployed to stable-diffusion-cpp-prod February 18, 2026 04:21 — with GitHub Actions Inactive
@loci-review
Copy link

loci-review bot commented Feb 18, 2026

Overview

Analysis of 48,340 functions across two Stable Diffusion C++ binaries reveals modest performance changes between versions. Modified functions: 102 (0.21%), new functions: 43, removed: 0, unchanged: 48,195 (99.7%).

Power Consumption:

  • build.bin.sd-server: 515,491 nJ → 518,799 nJ (+0.64%)
  • build.bin.sd-cli: 480,110 nJ → 483,356 nJ (+0.68%)

Function Analysis

The single intentional code change—VAE tiling threshold feature—shows significant improvements:

  • SDContextParams::operator() (both binaries): Response time improved 62% (6,757ns → 2,551ns), enabling adaptive VAE tiling based on image dimensions rather than fixed boolean flags

Most performance changes occur in C++ standard library functions with no source modifications:

Regressions:

  • std::vector::begin (sd-cli): +289% throughput time (+181ns), likely from compiler inlining changes
  • _M_const_cast (sd-server): +284% throughput time (+182ns), Red-Black tree iterator operations
  • make_move_iterator (sd-cli): +216% throughput time (+169ns), move semantics utility
  • __negate (sd-server): +190% throughput time (+176ns), HTTP validation predicate
  • _M_destroy (sd-cli): +180% throughput time (+189ns), RMSNorm shared_ptr cleanup
  • ggml_time_us (sd-cli): +113% throughput time (+141ns), timing utility overhead doubled

Improvements:

  • std::vector::end (sd-cli): -75% throughput time (-183ns), better iterator optimization
  • make_move_iterator (sd-server): -68% throughput time (-169ns), improved code generation

Other analyzed functions showed minor changes in error handling and JSON deserialization paths.

Additional Findings

Performance changes stem primarily from compiler/toolchain differences rather than application code. The VAE tiling enhancement provides intelligent memory management for large image generation while avoiding overhead on small images, justifying the minimal 0.64-0.68% power increase. RMSNorm destruction regression may accumulate during model cleanup but occurs outside core inference loops. Most regressions affect supporting infrastructure (STL containers, timing utilities) with sub-microsecond absolute impacts.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants