automatic utility threshold by peterboncz · Pull Request #14 · cwida/pac

peterboncz · 2026-04-02T22:17:51Z

Utility threshold: probabilistically NULL low-SNR aggregate results

Adds pac_utility_threshold setting that probabilistically NULLs aggregate result cells whose signal-to-noise ratio (z score = |value|/noise_std) is below a configurable threshold. Uses a smooth sigmoid P(keep) function — safe post-processing by the data processing inequality.
Default is NULL (disabled) — no behavior change unless explicitly enabled (e.g. SET pac_utility_threshold = 4).
Supported across all PAC aggregate finalize paths: pac_noised_sum, pac_noised_count, pac_noised_min/max, pac_noised_clip_sum, pac_noised_clip_min/max, and categorical pac_noised/pac_noised_div.
Extended PacNoisySampleFrom64Counters with an optional out_noise_variance output parameter so finalize functions can pass the noise variance to PacUtilityNull.
Noise variance scaling matches the compensation factor: 4x for sum/count (which use 2x compensation), 1x for min/max (no compensation).

New test/sql/pac_utility_threshold.test (49 assertions) verifying:

Default off: no NULLs
Low threshold: no NULLs
Moderate threshold: some NULLs (sum, count, clip_sum)
Very high threshold: all NULLed
Disable/re-enable cycle works
Clip variants (pac_clip_support = 40) work with utility threshold active
Clip min/max paths wired up (no crash)

Change suffix attenuation from soft-clamp (scale by 16^distance) to hard-zero (skip entirely). Unsupported magnitude levels now contribute nothing to the result, fully eliminating the variance side-channel. Attack results with clip_support=2: - Small filter (3-4 users): 96% → 47% (random) - 20K small items: 96% → 53% (random) - Std ratio in/out: 90x → 0.87x Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Finer-grained magnitude levels (2-bit bands, 4x per level) allow the clipping mechanism to catch moderate outliers that were previously invisible within the same 16x-wide level. A 10x outlier (50k vs 5k normal) now lands in a different level and gets hard-zeroed. Changes: - PAC2_LEVEL_SHIFT: 4 → 2 - PAC2_NUM_LEVELS: 31 → 32 (covers int64; HUGEINT clamps to level 31) - GetLevel/GetLevel128: divide by 2 instead of 4, clamp to max level - Inline optimization threshold: 13 → 14 - All shift extraction: level << 2 → level << 1 Memory: +8 bytes per state (256 vs 248 byte pointer array). Negligible. Performance: no regression on TPCH Q01 SF1 (1.38s → 1.31s). Security: moderate outlier attack drops from 76.5% to 52.9% (random). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…vior With hard-zero, unsupported outlier levels contribute nothing, so the clipped result equals (not exceeds) the no-outlier baseline. Change > to >=. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Increase PAC2_NUM_LEVELS from 32 to 62 to cover the full 128-bit range without clamping. int64 values naturally use only levels 0-29 (the extra pointer slots remain NULL, no per-level data is allocated). The inline optimization threshold moves from 14 to 44 accordingly. Memory: +240 bytes per state for the pointer array (496 vs 256 bytes). Per-level data allocations are unchanged for int64 workloads. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…i-group New test cases: - Level boundary routing (same-level vs cross-level with 4x bands) - HUGEINT outlier clipping (values at 2^70, beyond int64 range) - Negative HUGEINT outlier via neg_state - Over-clipping (clip_support > group size → zero result) - Multi-group with outlier isolated to one group Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fetched from main and added: - Development rules: test coverage, no test removal, codebase-first search, helper function reuse, duckdb submodule is read-only - Reference to the PAC paper (arXiv:2603.15023) - PAC_DEBUG_PRINT usage guidance Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Attack scripts testing the variance side-channel MIA against pac_clip_sum: - clip_attack_test.sh: main suite (small filter, wide filter, 10K users, etc.) - clip_multirow_test.sh: 20K small items user (tests pre-aggregation) - clip_hardzero_stress.sh: stress tests (high trials, composed queries, collusion) - clip_shift2_stress.sh: tests with 4x magnitude levels (shift=2) - clipping_experiment.sh: input clipping (Winsorization) baseline - output_clipping_experiment.sh: post-hoc output clipping baseline - output_clipping_v2_experiment.sh: output clipping before noise - clip_attack_results.md: full evaluation with findings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- CLAUDE.md: added code style rules (clang-tidy naming, clang-format style), attack evaluation section, development rules - .claude/settings.json: PostToolUse hook to auto-run make format-fix after edits - Skills: /run-attacks, /test-clip, /explain-pac, /explain-dp, /explain-pac-ddl Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>