UPSTREAM PR #1327: refactor: move all cache parameter defaults to the library by loci-dev · Pull Request #80 · auroralabs-loci/stable-diffusion.cpp

loci-dev · 2026-03-07T04:08:31Z

Note

Source pull request: leejet/stable-diffusion.cpp#1327

Simplifies cache initialization for library users.

loci-review · 2026-03-07T05:04:46Z

Overview

Analysis of 49,731 functions (98 modified, 2 new, 0 removed) across commit 31f042c ("refactor: move all cache parameter defaults to the library") shows minor overall impact with slight power efficiency improvements.

Binaries analyzed:

build.bin.sd-server: -0.017% power consumption (-87.70 nJ)
build.bin.sd-cli: -0.024% power consumption (-120.30 nJ)

The refactoring centralizes cache parameter initialization logic, trading minimal overhead in non-critical paths for improved code maintainability.

Function Analysis

Most significant changes:

nearest_int (build.bin.sd-server, ggml/src/ggml-cpu/repack.cpp):

Response time: 159.08ns → 351.22ns (+192.14ns, +120.8%)
Throughput time: 144.35ns → 336.49ns (+192.14ns, +133.1%)
Compiler-induced CFG fragmentation added entry block overhead. Located in ggml-cpu repack operations; potential impact if called frequently during quantization.

std::vector::begin (build.bin.sd-server):

Response time: 264.17ns → 83.36ns (-180.81ns, -68.4%)
Throughput time: 243.30ns → 62.49ns (-180.81ns, -74.3%)
Compiler optimization consolidated entry blocks, eliminating intermediate jumps.

sd_img_gen_params_to_str (both binaries):

Response time: +183ns (sd-server), +161ns (sd-cli)
Intentional overhead from new get_cache_reuse_threshold() call (202-204ns) implementing mode-specific defaults. Non-critical logging function.

std::vector allocator and string operations showed 40-60% improvements through compiler optimizations (entry block consolidation, reduced branching). Other analyzed functions saw negligible changes or minor regressions in non-critical paths (regex compilation, comparators).

Additional Findings

No changes to GPU kernels, ML inference operations, or performance-critical diffusion sampling loops. The refactoring successfully isolates overhead to initialization/logging functions while compiler optimizations in standard library code provide net performance benefits. The nearest_int regression warrants profiling to confirm it doesn't affect quantization hot paths.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

Simplifies cache initialization for library users.

loci-review · 2026-03-10T05:39:43Z

Overview

Analysis of 49,820 functions across 2 binaries reveals minimal performance impact from cache parameter refactoring. Modified: 116 functions (0.23%), New: 2, Removed: 40, Unchanged: 49,662 (99.68%).

Power Consumption:

build.bin.sd-server: +0.135% (+712.43 nJ)
build.bin.sd-cli: -0.099% (-486.34 nJ)

Changes removed non-functional cache preset support and consolidated defaults to the library layer across 2 commits affecting 6 files.

Function Analysis

Performance variations are primarily compiler code generation artifacts in STL template instantiations, not source code modifications. Core inference operations remain unchanged.

Key Regressions:

std::_Rb_tree::end() (sd-server): Response time +231% (+183 ns), throughput time +307% (+183 ns). Entry block restructured with additional basic block, increasing overhead 9x. No source changes—STL library code.
std::vector<ggml_backend_feature>::begin() (sd-server): Response time +217% (+181 ns), throughput time +289% (+181 ns). Unnecessary unconditional branch creates loop-like structure. STL template code unchanged.
GGMLRunner::copy_data_to_backend_tensor() (sd-cli): Response time +13% (+198 ns), throughput time +132% (+198 ns). Split entry prologue adds overhead. Source unchanged; called during initialization, not inference hot path.

Key Improvements:

std::vector<std::pair<string,float>>::begin() (sd-cli): Response time -68% (-181 ns), throughput time -74% (-181 ns). Consolidated entry block eliminates intermediate jumps.
std::_Hash_code_base::_M_bucket_index() (sd-cli): Response time -30% (-35 ns), throughput time -37% (-35 ns). Benefits all unordered_map operations used for tensor lookups.
std::vector<TensorStorage*>::empty() (sd-server): Response time -41% (-190 ns), throughput time -74% (-190 ns). Eliminated redundant control flow.

Other analyzed functions showed minor variations consistent with compiler optimization trade-offs.

Additional Findings

GPU/ML Operations: No impact on inference hot path. Backend tensor copy overhead (+198 ns) is negligible compared to actual memory transfer time (microseconds to milliseconds). UNet, VAE, and CLIP operations unchanged.

Source Code Context: Cache refactoring successfully simplified configuration management (81 lines removed) without algorithmic changes. Performance variations stem from compiler code generation differences (entry block splitting, instruction scheduling, template instantiation) rather than the refactoring itself. Improvements in hash table operations benefit model parameter and tensor lookups.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

loci-dev temporarily deployed to stable-diffusion-cpp-prod March 7, 2026 04:08 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from 682032b to dd19ab8 Compare March 8, 2026 04:59

wbruna added 2 commits March 9, 2026 14:25

refactor: move all cache parameter defaults to the library

c2b4a2e

Simplifies cache initialization for library users.

chore: remove non-functional cache preset support

6950062

loci-dev force-pushed the main branch from dd19ab8 to 98460a7 Compare March 10, 2026 04:15

loci-dev force-pushed the loci/pr-1327-sd_cache_defaults_init branch from 31f042c to 6950062 Compare March 10, 2026 04:39

loci-dev deployed to stable-diffusion-cpp-prod March 10, 2026 04:39 — with GitHub Actions Active

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #1327: refactor: move all cache parameter defaults to the library#80

UPSTREAM PR #1327: refactor: move all cache parameter defaults to the library#80
loci-dev wants to merge 2 commits intomainfrom
loci/pr-1327-sd_cache_defaults_init

loci-dev commented Mar 7, 2026

Uh oh!

loci-review bot commented Mar 7, 2026

Uh oh!

loci-review bot commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Mar 7, 2026

Uh oh!

loci-review bot commented Mar 7, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

loci-review bot commented Mar 10, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants