feat: Add Bloom Filter Support for Equality Query Optimization #69

suxiaogang223 · 2025-12-06T12:42:16Z

Summary

This PR implements Bloom Filter support for ORC files to optimize equality query performance. Bloom Filters allow the reader to quickly determine if a value might exist in a row group, enabling entire row groups to be skipped when the filter indicates the value is definitely not present.

This builds upon the Row Group Index support implemented in #64 and is part of the ongoing predicate pushdown optimization work.

Changes

Core Implementation

New module bloom_filter.rs: Implements Bloom Filter parsing and querying
- Parses Bloom Filter data from BLOOM_FILTER and BLOOM_FILTER_UTF8 streams
- Implements might_contain() method to check if a value might be present
- Supports both regular bitset (for numeric types) and UTF8 bitset (for string types)
Hash functions:
- murmur3_64(): Murmur3-128 hash function for string and binary types (per ORC spec)
- thomas_wang_hash64(): Thomas Wang's 64-bit integer hash for numeric types
Row index integration:
- Extended RowGroupEntry to include optional BloomFilter
- Added parse_bloom_filter_index() to parse Bloom Filter streams
- Updated parse_stripe_row_indexes() to combine row indexes with Bloom Filters
Predicate pushdown integration:
- Modified evaluate_comparison() in row_group_filter.rs to check Bloom Filters first for equality queries
- If Bloom Filter indicates value is definitely not present, the row group is skipped immediately
- Falls back to statistics-based filtering if Bloom Filter check passes or is unavailable

Benefits

Performance improvement: Equality queries can skip entire row groups without reading data when Bloom Filters indicate the value is not present
Backward compatible: Works gracefully when Bloom Filters are not present in ORC files
Standards compliant: Follows ORC specification for Bloom Filter implementation

Testing

Added unit tests for hash functions (test_thomas_wang_hash, test_murmur3_hash)
Added unit tests for Bloom Filter parsing (test_bloom_filter_empty, test_bloom_filter_int64)
Added comprehensive integration tests using bloom_filter_test.orc file:
- Basic read test to verify file structure and schema
- Equality query tests for integer columns (existing and non-existent values)
- Equality query tests for string columns (name and email)
- Equality query tests for age column
- Backward compatibility test (reading without predicates)
- Total of 9 integration tests covering various scenarios
All tests pass successfully

Technical Details

Bloom Filter Structure

According to ORC specification:

Each row group has its own Bloom Filter
Bloom Filters use Murmur3 64-bit hash for strings/binary
Numeric types use Thomas Wang's 64-bit integer hash
Bloom Filters are stored in BLOOM_FILTER or BLOOM_FILTER_UTF8 streams

Query Optimization Flow

For equality queries (ComparisonOp::Equal), check Bloom Filter first
If Bloom Filter returns false, skip the row group immediately
If Bloom Filter returns true (or is unavailable), continue with statistics-based filtering
This two-stage filtering maximizes performance while maintaining correctness

Related Issues

This PR extends the predicate pushdown work from #64 by adding Bloom Filter support for equality queries. While not directly addressing #58, this optimization complements the RowSelection API design by providing another layer of filtering at the row group level.

Checklist

Code follows project style guidelines
Tests added and passing
Backward compatible (gracefully handles files without Bloom Filters)
Follows ORC specification

progval

Added integration tests using existing over1k_bloom.orc file

I don't see them.

Could you also add unit tests on non-empty Bloom Filters?

src/bloom_filter.rs

src/row_group_filter.rs

progval · 2025-12-06T13:05:44Z

src/row_index.rs

+            // Parse Bloom Filter index if available
+            let bloom_filters =
+                parse_bloom_filter_index(stripe_stream_map, column, proto_row_index.entry.len())?;
+
+            // Parse into RowGroupIndex with Bloom Filters
+            let row_group_index = parse_row_index_with_bloom_filters(
+                &proto_row_index,
+                column_id as usize,
+                rows_per_group,
+                bloom_filters,
+            )?;


This is probably fine for now, but let's keep in mind that it may be desirable in the future to only fetch/parse Bloom Filters after checking statistics, because they are relatively large.

suxiaogang223 · 2025-12-06T13:20:30Z

Added integration tests using existing over1k_bloom.orc file

I don't see them.

Could you also add unit tests on non-empty Bloom Filters?

I generated a new orc file with bloom filter by pyarrow, and added cases with this.

progval

I generated a new orc file with bloom filter by pyarrow, and added cases with this.

Please add the Python file so we can easily regenerate it if needed.

tests/integration/main.rs

progval · 2025-12-06T13:32:56Z

tests/integration/main.rs

+    // Should return at least some rows (Bloom Filter allows reading the row group)
+    let total_rows: usize = batches.iter().map(|b| b.num_rows()).sum();
+    assert!(
+        total_rows > 0,
+        "Should return rows when Bloom Filter indicates value might exist"
+    );


Could you also check that some rows were eliminated by the BF? (And not just by statistics)

progval · 2025-12-06T13:34:46Z

tests/integration/main.rs

+    );
+    let f = File::open(path).unwrap();
+
+    // Query for id = 9999 (doesn't exist, ids are 1-1000)


This is not sufficient to check the row group was eliminated by BF because statistics would also eliminate it.

progval · 2025-12-06T13:35:14Z

tests/integration/main.rs

+    );
+    let f = File::open(path).unwrap();
+
+    // Query for name = "nonexistent_user" (doesn't exist)


ditto (probably, depends on values in the file)

progval · 2025-12-06T13:35:35Z

tests/integration/main.rs

+    );
+    let f = File::open(path).unwrap();
+
+    // Query for age = 100 (doesn't exist, ages are 20-69)


codecov-commenter · 2025-12-07T13:24:45Z

Codecov Report

❌ Patch coverage is 88.27930% with 47 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@fdee23b). Learn more about missing BASE report.

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #69   +/-   ##
=======================================
  Coverage        ?   85.41%           
=======================================
  Files           ?       47           
  Lines           ?     7746           
  Branches        ?        0           
=======================================
  Hits            ?     6616           
  Misses          ?     1130           
  Partials        ?        0

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

suxiaogang223 · 2025-12-12T18:02:47Z

After testing, it was found that the Bloom implementation of this PR was inconsistent with the ORC specification :(, so I reimplemented a new PR: #72

suxiaogang223 added 3 commits December 6, 2025 19:28

impl bloom_filter

692998d

impl parse_bloom_filter_index

06732fb

fix

707fc60

progval reviewed Dec 6, 2025

View reviewed changes

src/bloom_filter.rs Outdated Show resolved Hide resolved

progval reviewed Dec 6, 2025

View reviewed changes

src/row_group_filter.rs Outdated Show resolved Hide resolved

progval reviewed Dec 6, 2025

View reviewed changes

suxiaogang223 added 2 commits December 6, 2025 21:11

add case

4a254fb

fix

c590c7e

fix

54798d5

progval reviewed Dec 6, 2025

View reviewed changes

suxiaogang223 added 5 commits December 6, 2025 22:38

filter by statistics first, then bloom filter

0de2e72

fix clippy

8ea01dc

add generate_orc_with_bloom_filter.py

d291ef7

test_expected_file("bloom_filter_test")

e5f04bc

fix license header

6243252

suxiaogang223 closed this Dec 12, 2025

suxiaogang223 deleted the support_bloom_filter branch December 12, 2025 18:11

feat: Add Bloom Filter Support for Equality Query Optimization #69

feat: Add Bloom Filter Support for Equality Query Optimization #69

Uh oh!

Conversation

suxiaogang223 commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Core Implementation

Benefits

Testing

Technical Details

Bloom Filter Structure

Query Optimization Flow

Related Issues

Checklist

Uh oh!

progval left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

progval Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

suxiaogang223 commented Dec 6, 2025

Uh oh!

progval left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

progval Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

progval Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

progval Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

progval Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Dec 7, 2025

Codecov Report

Uh oh!

suxiaogang223 commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

suxiaogang223 commented Dec 6, 2025 •

edited

Loading