Skip to content

UPSTREAM PR #1327: refactor: move all cache parameter defaults to the library#80

Open
loci-dev wants to merge 2 commits intomainfrom
loci/pr-1327-sd_cache_defaults_init
Open

UPSTREAM PR #1327: refactor: move all cache parameter defaults to the library#80
loci-dev wants to merge 2 commits intomainfrom
loci/pr-1327-sd_cache_defaults_init

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Mar 7, 2026

Note

Source pull request: leejet/stable-diffusion.cpp#1327

Simplifies cache initialization for library users.

@loci-dev loci-dev temporarily deployed to stable-diffusion-cpp-prod March 7, 2026 04:08 — with GitHub Actions Inactive
@loci-review
Copy link

loci-review bot commented Mar 7, 2026

Overview

Analysis of 49,731 functions (98 modified, 2 new, 0 removed) across commit 31f042c ("refactor: move all cache parameter defaults to the library") shows minor overall impact with slight power efficiency improvements.

Binaries analyzed:

  • build.bin.sd-server: -0.017% power consumption (-87.70 nJ)
  • build.bin.sd-cli: -0.024% power consumption (-120.30 nJ)

The refactoring centralizes cache parameter initialization logic, trading minimal overhead in non-critical paths for improved code maintainability.

Function Analysis

Most significant changes:

nearest_int (build.bin.sd-server, ggml/src/ggml-cpu/repack.cpp):

  • Response time: 159.08ns → 351.22ns (+192.14ns, +120.8%)
  • Throughput time: 144.35ns → 336.49ns (+192.14ns, +133.1%)
  • Compiler-induced CFG fragmentation added entry block overhead. Located in ggml-cpu repack operations; potential impact if called frequently during quantization.

std::vector::begin (build.bin.sd-server):

  • Response time: 264.17ns → 83.36ns (-180.81ns, -68.4%)
  • Throughput time: 243.30ns → 62.49ns (-180.81ns, -74.3%)
  • Compiler optimization consolidated entry blocks, eliminating intermediate jumps.

sd_img_gen_params_to_str (both binaries):

  • Response time: +183ns (sd-server), +161ns (sd-cli)
  • Intentional overhead from new get_cache_reuse_threshold() call (202-204ns) implementing mode-specific defaults. Non-critical logging function.

std::vector allocator and string operations showed 40-60% improvements through compiler optimizations (entry block consolidation, reduced branching). Other analyzed functions saw negligible changes or minor regressions in non-critical paths (regex compilation, comparators).

Additional Findings

No changes to GPU kernels, ML inference operations, or performance-critical diffusion sampling loops. The refactoring successfully isolates overhead to initialization/logging functions while compiler optimizations in standard library code provide net performance benefits. The nearest_int regression warrants profiling to confirm it doesn't affect quantization hot paths.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the loci/pr-1327-sd_cache_defaults_init branch from 31f042c to 6950062 Compare March 10, 2026 04:39
@loci-dev loci-dev deployed to stable-diffusion-cpp-prod March 10, 2026 04:39 — with GitHub Actions Active
@loci-review
Copy link

loci-review bot commented Mar 10, 2026

Overview

Analysis of 49,820 functions across 2 binaries reveals minimal performance impact from cache parameter refactoring. Modified: 116 functions (0.23%), New: 2, Removed: 40, Unchanged: 49,662 (99.68%).

Power Consumption:

  • build.bin.sd-server: +0.135% (+712.43 nJ)
  • build.bin.sd-cli: -0.099% (-486.34 nJ)

Changes removed non-functional cache preset support and consolidated defaults to the library layer across 2 commits affecting 6 files.

Function Analysis

Performance variations are primarily compiler code generation artifacts in STL template instantiations, not source code modifications. Core inference operations remain unchanged.

Key Regressions:

  • std::_Rb_tree::end() (sd-server): Response time +231% (+183 ns), throughput time +307% (+183 ns). Entry block restructured with additional basic block, increasing overhead 9x. No source changes—STL library code.
  • std::vector<ggml_backend_feature>::begin() (sd-server): Response time +217% (+181 ns), throughput time +289% (+181 ns). Unnecessary unconditional branch creates loop-like structure. STL template code unchanged.
  • GGMLRunner::copy_data_to_backend_tensor() (sd-cli): Response time +13% (+198 ns), throughput time +132% (+198 ns). Split entry prologue adds overhead. Source unchanged; called during initialization, not inference hot path.

Key Improvements:

  • std::vector<std::pair<string,float>>::begin() (sd-cli): Response time -68% (-181 ns), throughput time -74% (-181 ns). Consolidated entry block eliminates intermediate jumps.
  • std::_Hash_code_base::_M_bucket_index() (sd-cli): Response time -30% (-35 ns), throughput time -37% (-35 ns). Benefits all unordered_map operations used for tensor lookups.
  • std::vector<TensorStorage*>::empty() (sd-server): Response time -41% (-190 ns), throughput time -74% (-190 ns). Eliminated redundant control flow.

Other analyzed functions showed minor variations consistent with compiler optimization trade-offs.

Additional Findings

GPU/ML Operations: No impact on inference hot path. Backend tensor copy overhead (+198 ns) is negligible compared to actual memory transfer time (microseconds to milliseconds). UNet, VAE, and CLIP operations unchanged.

Source Code Context: Cache refactoring successfully simplified configuration management (81 lines removed) without algorithmic changes. Performance variations stem from compiler code generation differences (entry block splitting, instruction scheduling, template instantiation) rather than the refactoring itself. Improvements in hash table operations benefit model parameter and tensor lookups.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants