-
Notifications
You must be signed in to change notification settings - Fork 17
feat: Add Bloom Filter Support for Equality Query Optimization #69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add Bloom Filter Support for Equality Query Optimization #69
Conversation
progval
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added integration tests using existing over1k_bloom.orc file
I don't see them.
Could you also add unit tests on non-empty Bloom Filters?
| // Parse Bloom Filter index if available | ||
| let bloom_filters = | ||
| parse_bloom_filter_index(stripe_stream_map, column, proto_row_index.entry.len())?; | ||
|
|
||
| // Parse into RowGroupIndex with Bloom Filters | ||
| let row_group_index = parse_row_index_with_bloom_filters( | ||
| &proto_row_index, | ||
| column_id as usize, | ||
| rows_per_group, | ||
| bloom_filters, | ||
| )?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably fine for now, but let's keep in mind that it may be desirable in the future to only fetch/parse Bloom Filters after checking statistics, because they are relatively large.
I generated a new orc file with bloom filter by pyarrow, and added cases with this. |
progval
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generated a new orc file with bloom filter by pyarrow, and added cases with this.
Please add the Python file so we can easily regenerate it if needed.
| // Should return at least some rows (Bloom Filter allows reading the row group) | ||
| let total_rows: usize = batches.iter().map(|b| b.num_rows()).sum(); | ||
| assert!( | ||
| total_rows > 0, | ||
| "Should return rows when Bloom Filter indicates value might exist" | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also check that some rows were eliminated by the BF? (And not just by statistics)
| ); | ||
| let f = File::open(path).unwrap(); | ||
|
|
||
| // Query for id = 9999 (doesn't exist, ids are 1-1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not sufficient to check the row group was eliminated by BF because statistics would also eliminate it.
| ); | ||
| let f = File::open(path).unwrap(); | ||
|
|
||
| // Query for name = "nonexistent_user" (doesn't exist) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto (probably, depends on values in the file)
| ); | ||
| let f = File::open(path).unwrap(); | ||
|
|
||
| // Query for age = 100 (doesn't exist, ages are 20-69) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #69 +/- ##
=======================================
Coverage ? 85.41%
=======================================
Files ? 47
Lines ? 7746
Branches ? 0
=======================================
Hits ? 6616
Misses ? 1130
Partials ? 0 🚀 New features to boost your workflow:
|
|
After testing, it was found that the Bloom implementation of this PR was inconsistent with the ORC specification :(, so I reimplemented a new PR: #72 |
Summary
This PR implements Bloom Filter support for ORC files to optimize equality query performance. Bloom Filters allow the reader to quickly determine if a value might exist in a row group, enabling entire row groups to be skipped when the filter indicates the value is definitely not present.
This builds upon the Row Group Index support implemented in #64 and is part of the ongoing predicate pushdown optimization work.
Changes
Core Implementation
New module
bloom_filter.rs: Implements Bloom Filter parsing and queryingBLOOM_FILTERandBLOOM_FILTER_UTF8streamsmight_contain()method to check if a value might be presentHash functions:
murmur3_64(): Murmur3-128 hash function for string and binary types (per ORC spec)thomas_wang_hash64(): Thomas Wang's 64-bit integer hash for numeric typesRow index integration:
RowGroupEntryto include optionalBloomFilterparse_bloom_filter_index()to parse Bloom Filter streamsparse_stripe_row_indexes()to combine row indexes with Bloom FiltersPredicate pushdown integration:
evaluate_comparison()inrow_group_filter.rsto check Bloom Filters first for equality queriesBenefits
Testing
test_thomas_wang_hash,test_murmur3_hash)test_bloom_filter_empty,test_bloom_filter_int64)bloom_filter_test.orcfile:Technical Details
Bloom Filter Structure
According to ORC specification:
BLOOM_FILTERorBLOOM_FILTER_UTF8streamsQuery Optimization Flow
ComparisonOp::Equal), check Bloom Filter firstfalse, skip the row group immediatelytrue(or is unavailable), continue with statistics-based filteringRelated Issues
This PR extends the predicate pushdown work from #64 by adding Bloom Filter support for equality queries. While not directly addressing #58, this optimization complements the RowSelection API design by providing another layer of filtering at the row group level.
Checklist