Skip to content

automatic utility threshold#14

Merged
peterboncz merged 32 commits intopac_utility_thresholdfrom
main
Apr 2, 2026
Merged

automatic utility threshold#14
peterboncz merged 32 commits intopac_utility_thresholdfrom
main

Conversation

@peterboncz
Copy link
Copy Markdown
Collaborator

@peterboncz peterboncz commented Apr 2, 2026

Utility threshold: probabilistically NULL low-SNR aggregate results

  • Adds pac_utility_threshold setting that probabilistically NULLs aggregate result cells whose signal-to-noise ratio (z score = |value|/noise_std) is below a configurable threshold. Uses a smooth sigmoid P(keep) function — safe post-processing by the data processing inequality.
  • Default is NULL (disabled) — no behavior change unless explicitly enabled (e.g. SET pac_utility_threshold = 4).
  • Supported across all PAC aggregate finalize paths: pac_noised_sum, pac_noised_count, pac_noised_min/max, pac_noised_clip_sum, pac_noised_clip_min/max, and categorical pac_noised/pac_noised_div.
  • Extended PacNoisySampleFrom64Counters with an optional out_noise_variance output parameter so finalize functions can pass the noise variance to PacUtilityNull.
  • Noise variance scaling matches the compensation factor: 4x for sum/count (which use 2x compensation), 1x for min/max (no compensation).

New test/sql/pac_utility_threshold.test (49 assertions) verifying:

  • Default off: no NULLs
  • Low threshold: no NULLs
  • Moderate threshold: some NULLs (sum, count, clip_sum)
  • Very high threshold: all NULLed
  • Disable/re-enable cycle works
  • Clip variants (pac_clip_support = 40) work with utility threshold active
  • Clip min/max paths wired up (no crash)

peterboncz and others added 30 commits March 23, 2026 23:36
Change suffix attenuation from soft-clamp (scale by 16^distance) to hard-zero
(skip entirely). Unsupported magnitude levels now contribute nothing to the
result, fully eliminating the variance side-channel.

Attack results with clip_support=2:
- Small filter (3-4 users): 96% → 47% (random)
- 20K small items: 96% → 53% (random)
- Std ratio in/out: 90x → 0.87x

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Finer-grained magnitude levels (2-bit bands, 4x per level) allow the clipping
mechanism to catch moderate outliers that were previously invisible within the
same 16x-wide level. A 10x outlier (50k vs 5k normal) now lands in a different
level and gets hard-zeroed.

Changes:
- PAC2_LEVEL_SHIFT: 4 → 2
- PAC2_NUM_LEVELS: 31 → 32 (covers int64; HUGEINT clamps to level 31)
- GetLevel/GetLevel128: divide by 2 instead of 4, clamp to max level
- Inline optimization threshold: 13 → 14
- All shift extraction: level << 2 → level << 1

Memory: +8 bytes per state (256 vs 248 byte pointer array). Negligible.
Performance: no regression on TPCH Q01 SF1 (1.38s → 1.31s).
Security: moderate outlier attack drops from 76.5% to 52.9% (random).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…vior

With hard-zero, unsupported outlier levels contribute nothing, so the
clipped result equals (not exceeds) the no-outlier baseline. Change > to >=.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Increase PAC2_NUM_LEVELS from 32 to 62 to cover the full 128-bit range
without clamping. int64 values naturally use only levels 0-29 (the extra
pointer slots remain NULL, no per-level data is allocated). The inline
optimization threshold moves from 14 to 44 accordingly.

Memory: +240 bytes per state for the pointer array (496 vs 256 bytes).
Per-level data allocations are unchanged for int64 workloads.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…i-group

New test cases:
- Level boundary routing (same-level vs cross-level with 4x bands)
- HUGEINT outlier clipping (values at 2^70, beyond int64 range)
- Negative HUGEINT outlier via neg_state
- Over-clipping (clip_support > group size → zero result)
- Multi-group with outlier isolated to one group

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fetched from main and added:
- Development rules: test coverage, no test removal, codebase-first search,
  helper function reuse, duckdb submodule is read-only
- Reference to the PAC paper (arXiv:2603.15023)
- PAC_DEBUG_PRINT usage guidance

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Attack scripts testing the variance side-channel MIA against pac_clip_sum:
- clip_attack_test.sh: main suite (small filter, wide filter, 10K users, etc.)
- clip_multirow_test.sh: 20K small items user (tests pre-aggregation)
- clip_hardzero_stress.sh: stress tests (high trials, composed queries, collusion)
- clip_shift2_stress.sh: tests with 4x magnitude levels (shift=2)
- clipping_experiment.sh: input clipping (Winsorization) baseline
- output_clipping_experiment.sh: post-hoc output clipping baseline
- output_clipping_v2_experiment.sh: output clipping before noise
- clip_attack_results.md: full evaluation with findings

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- CLAUDE.md: added code style rules (clang-tidy naming, clang-format style),
  attack evaluation section, development rules
- .claude/settings.json: PostToolUse hook to auto-run make format-fix after edits
- Skills: /run-attacks, /test-clip, /explain-pac, /explain-dp, /explain-pac-ddl

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change suffix attenuation from soft-clamp (scale by 16^distance) to hard-zero
(skip entirely). Unsupported magnitude levels now contribute nothing to the
result, fully eliminating the variance side-channel.

Attack results with clip_support=2:
- Small filter (3-4 users): 96% → 47% (random)
- 20K small items: 96% → 53% (random)
- Std ratio in/out: 90x → 0.87x

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Finer-grained magnitude levels (2-bit bands, 4x per level) allow the clipping
mechanism to catch moderate outliers that were previously invisible within the
same 16x-wide level. A 10x outlier (50k vs 5k normal) now lands in a different
level and gets hard-zeroed.

Changes:
- PAC2_LEVEL_SHIFT: 4 → 2
- PAC2_NUM_LEVELS: 31 → 32 (covers int64; HUGEINT clamps to level 31)
- GetLevel/GetLevel128: divide by 2 instead of 4, clamp to max level
- Inline optimization threshold: 13 → 14
- All shift extraction: level << 2 → level << 1

Memory: +8 bytes per state (256 vs 248 byte pointer array). Negligible.
Performance: no regression on TPCH Q01 SF1 (1.38s → 1.31s).
Security: moderate outlier attack drops from 76.5% to 52.9% (random).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…vior

With hard-zero, unsupported outlier levels contribute nothing, so the
clipped result equals (not exceeds) the no-outlier baseline. Change > to >=.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Increase PAC2_NUM_LEVELS from 32 to 62 to cover the full 128-bit range
without clamping. int64 values naturally use only levels 0-29 (the extra
pointer slots remain NULL, no per-level data is allocated). The inline
optimization threshold moves from 14 to 44 accordingly.

Memory: +240 bytes per state for the pointer array (496 vs 256 bytes).
Per-level data allocations are unchanged for int64 workloads.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…i-group

New test cases:
- Level boundary routing (same-level vs cross-level with 4x bands)
- HUGEINT outlier clipping (values at 2^70, beyond int64 range)
- Negative HUGEINT outlier via neg_state
- Over-clipping (clip_support > group size → zero result)
- Multi-group with outlier isolated to one group

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fetched from main and added:
- Development rules: test coverage, no test removal, codebase-first search,
  helper function reuse, duckdb submodule is read-only
- Reference to the PAC paper (arXiv:2603.15023)
- PAC_DEBUG_PRINT usage guidance

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Attack scripts testing the variance side-channel MIA against pac_clip_sum:
- clip_attack_test.sh: main suite (small filter, wide filter, 10K users, etc.)
- clip_multirow_test.sh: 20K small items user (tests pre-aggregation)
- clip_hardzero_stress.sh: stress tests (high trials, composed queries, collusion)
- clip_shift2_stress.sh: tests with 4x magnitude levels (shift=2)
- clipping_experiment.sh: input clipping (Winsorization) baseline
- output_clipping_experiment.sh: post-hoc output clipping baseline
- output_clipping_v2_experiment.sh: output clipping before noise
- clip_attack_results.md: full evaluation with findings

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- CLAUDE.md: added code style rules (clang-tidy naming, clang-format style),
  attack evaluation section, development rules
- .claude/settings.json: PostToolUse hook to auto-run make format-fix after edits
- Skills: /run-attacks, /test-clip, /explain-pac, /explain-dp, /explain-pac-ddl

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document the pac_metadata JSON sidecar files: naming convention, auto-loading,
save/clear pragmas, and the important note to delete metadata when recreating DBs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
explain-pac: added formal PAC definition, 4-step privatization template,
MI-to-posterior success rate table, composition theorem, PAC vs DP comparison,
and SIMD-PAC-DB implementation details.

explain-dp: added PAC vs DP comparison table, loose bounds insight,
privacy-conscious design (MSE = Bias² + (1/(2B)+1)·Var), and implications
for clipping (reducing variance improves privacy-utility tradeoff).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements level-based clipping for MIN/MAX aggregates (pac_clip_min,
pac_clip_max, pac_noised_clip_min, pac_noised_clip_max) using int8_t
extremes with per-level bitmaps for support estimation. Replaces the
previous alias-only stubs with a real implementation that reuses
UpdateExtremesSIMD from pac_min_max.hpp.

Adds native FLOAT/DOUBLE overloads for pac_clip_sum and pac_clip_min_max
using power-of-2 scale factors (2^20 for float, 2^27 for double) to
convert to int64 before entering the integer-based level machinery.
Removes the lossy BIGINT cast workaround from the expression builder.

Includes BOUNDOPT (per-level bound optimization), AllValid fast paths,
and shared ScaleFloatToInt64 helper with branchless clamping.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds github.com/ila/duckdb-claude-skills at .claude/skills/shared/
with 7 generic DuckDB extension skills: best-practices, code-review,
plan-feature, project-review, duckdb-internals, write-docs, run-tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… previous

 merge commit already -- apologies).

Refactor pac_clip: shared code, two-sided unsigned min/max, unified outlier clipping

- Factor shared code into pac_clip_aggr.hpp: CLIP_* constants, ScaleFloatToInt64,
  ClipEstimateDistinct, PacClipBindData, PacClipBind functions. Remove duplicates
  from pac_clip_sum.hpp/cpp and pac_clip_min_max.hpp/cpp.

- Convert pac_clip_min_max from signed int8_t to two-sided unsigned uint8_t:
  positive values in pos_state, absolute negatives in neg_state (with !IS_MAX).
  GetLevel threshold 128→256, giving 8-bit precision instead of 7-bit.
  Lazy neg_state allocation: positive-only data never allocates it.

- Unify outlier elimination across sum and min/max using shared
  ClipFindSupportedRange and ClipEffectiveLevel helpers. Both now use
  first/last supported boundary logic (min/max previously did per-level
  independent filtering, missing interior-level preservation).

- Add pac_clip_scale setting (BOOLEAN, default false). When false, unsupported
  prefix/suffix levels are omitted. When true, they are scaled to the nearest
  supported boundary (4^distance). This replaces sum's previous asymmetric
  behavior (prefix scaled, suffix omitted) with a symmetric policy.

- Remove stale clip min/max stub registrations from pac_min_max.cpp
  (superseded by real implementations in pac_clip_min_max.cpp).

- Remove C++17 if constexpr usage from pac_clip_min_max.

- Add tests for negative values, mixed pos/neg, negative-only, and
  neg-outlier clipping in pac_clip_min_max.test.
memory optimizations for clipping:
- save second state pointer for unsigned types (one-sided)
- only hugeint needs 62 levels, int64 can do with 30
  use templating to make both variants possible in the same code
- we do not reduce below int64 because if we would, inlining would
  not work and there would be no memory savings anyway
ila and others added 2 commits April 2, 2026 22:46
pac_clip_sum, pac_clip_min, pac_clip_max: Per-User Contribution Clipping for PAC Aggregates
@peterboncz peterboncz merged commit e6ebdf4 into pac_utility_threshold Apr 2, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants