Skip to content

Conversation

@suxiaogang223
Copy link
Contributor

@suxiaogang223 suxiaogang223 commented Dec 6, 2025

Summary

This PR implements Bloom Filter support for ORC files to optimize equality query performance. Bloom Filters allow the reader to quickly determine if a value might exist in a row group, enabling entire row groups to be skipped when the filter indicates the value is definitely not present.

This builds upon the Row Group Index support implemented in #64 and is part of the ongoing predicate pushdown optimization work.

Changes

Core Implementation

  • New module bloom_filter.rs: Implements Bloom Filter parsing and querying

    • Parses Bloom Filter data from BLOOM_FILTER and BLOOM_FILTER_UTF8 streams
    • Implements might_contain() method to check if a value might be present
    • Supports both regular bitset (for numeric types) and UTF8 bitset (for string types)
  • Hash functions:

    • murmur3_64(): Murmur3-128 hash function for string and binary types (per ORC spec)
    • thomas_wang_hash64(): Thomas Wang's 64-bit integer hash for numeric types
  • Row index integration:

    • Extended RowGroupEntry to include optional BloomFilter
    • Added parse_bloom_filter_index() to parse Bloom Filter streams
    • Updated parse_stripe_row_indexes() to combine row indexes with Bloom Filters
  • Predicate pushdown integration:

    • Modified evaluate_comparison() in row_group_filter.rs to check Bloom Filters first for equality queries
    • If Bloom Filter indicates value is definitely not present, the row group is skipped immediately
    • Falls back to statistics-based filtering if Bloom Filter check passes or is unavailable

Benefits

  • Performance improvement: Equality queries can skip entire row groups without reading data when Bloom Filters indicate the value is not present
  • Backward compatible: Works gracefully when Bloom Filters are not present in ORC files
  • Standards compliant: Follows ORC specification for Bloom Filter implementation

Testing

  • Added unit tests for hash functions (test_thomas_wang_hash, test_murmur3_hash)
  • Added unit tests for Bloom Filter parsing (test_bloom_filter_empty, test_bloom_filter_int64)
  • Added comprehensive integration tests using bloom_filter_test.orc file:
    • Basic read test to verify file structure and schema
    • Equality query tests for integer columns (existing and non-existent values)
    • Equality query tests for string columns (name and email)
    • Equality query tests for age column
    • Backward compatibility test (reading without predicates)
    • Total of 9 integration tests covering various scenarios
  • All tests pass successfully

Technical Details

Bloom Filter Structure

According to ORC specification:

  • Each row group has its own Bloom Filter
  • Bloom Filters use Murmur3 64-bit hash for strings/binary
  • Numeric types use Thomas Wang's 64-bit integer hash
  • Bloom Filters are stored in BLOOM_FILTER or BLOOM_FILTER_UTF8 streams

Query Optimization Flow

  1. For equality queries (ComparisonOp::Equal), check Bloom Filter first
  2. If Bloom Filter returns false, skip the row group immediately
  3. If Bloom Filter returns true (or is unavailable), continue with statistics-based filtering
  4. This two-stage filtering maximizes performance while maintaining correctness

Related Issues

This PR extends the predicate pushdown work from #64 by adding Bloom Filter support for equality queries. While not directly addressing #58, this optimization complements the RowSelection API design by providing another layer of filtering at the row group level.

Checklist

  • Code follows project style guidelines
  • Tests added and passing
  • Backward compatible (gracefully handles files without Bloom Filters)
  • Follows ORC specification

Copy link
Collaborator

@progval progval left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added integration tests using existing over1k_bloom.orc file

I don't see them.

Could you also add unit tests on non-empty Bloom Filters?

Comment on lines +263 to +273
// Parse Bloom Filter index if available
let bloom_filters =
parse_bloom_filter_index(stripe_stream_map, column, proto_row_index.entry.len())?;

// Parse into RowGroupIndex with Bloom Filters
let row_group_index = parse_row_index_with_bloom_filters(
&proto_row_index,
column_id as usize,
rows_per_group,
bloom_filters,
)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably fine for now, but let's keep in mind that it may be desirable in the future to only fetch/parse Bloom Filters after checking statistics, because they are relatively large.

@suxiaogang223
Copy link
Contributor Author

Added integration tests using existing over1k_bloom.orc file

I don't see them.

Could you also add unit tests on non-empty Bloom Filters?

I generated a new orc file with bloom filter by pyarrow, and added cases with this.

Copy link
Collaborator

@progval progval left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generated a new orc file with bloom filter by pyarrow, and added cases with this.

Please add the Python file so we can easily regenerate it if needed.

Comment on lines +421 to +426
// Should return at least some rows (Bloom Filter allows reading the row group)
let total_rows: usize = batches.iter().map(|b| b.num_rows()).sum();
assert!(
total_rows > 0,
"Should return rows when Bloom Filter indicates value might exist"
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also check that some rows were eliminated by the BF? (And not just by statistics)

);
let f = File::open(path).unwrap();

// Query for id = 9999 (doesn't exist, ids are 1-1000)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not sufficient to check the row group was eliminated by BF because statistics would also eliminate it.

);
let f = File::open(path).unwrap();

// Query for name = "nonexistent_user" (doesn't exist)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto (probably, depends on values in the file)

);
let f = File::open(path).unwrap();

// Query for age = 100 (doesn't exist, ages are 20-69)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 88.27930% with 47 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@fdee23b). Learn more about missing BASE report.

Additional details and impacted files
@@           Coverage Diff           @@
##             main      #69   +/-   ##
=======================================
  Coverage        ?   85.41%           
=======================================
  Files           ?       47           
  Lines           ?     7746           
  Branches        ?        0           
=======================================
  Hits            ?     6616           
  Misses          ?     1130           
  Partials        ?        0           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@suxiaogang223
Copy link
Contributor Author

After testing, it was found that the Bloom implementation of this PR was inconsistent with the ORC specification :(, so I reimplemented a new PR: #72

@suxiaogang223 suxiaogang223 deleted the support_bloom_filter branch December 12, 2025 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants