-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Hash UnionArrays #18718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Hash UnionArrays #18718
Conversation
71fab8e to
e8fb251
Compare
| #[allow(clippy::needless_range_loop)] | ||
| for i in 0..array.len() { | ||
| let type_id = array.type_id(i); | ||
| let child_offset = array.value_offset(i); | ||
|
|
||
| let child_hash = &child_hashes[type_id as usize] | ||
| .as_ref() | ||
| .expect("invalid type_id"); | ||
|
|
||
| hashes_buffer[i] = child_hash[child_offset]; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that clippy's suggestion hurt readability
error: the loop variable `i` is used to index `hashes_buffer`
--> datafusion/common/src/hash_utils.rs:351:14
|
351 | for i in 0..array.len() {
| ^^^^^^^^^^^^^^
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/rust-1.91.0/index.html#needless_range_loop
= note: `-D clippy::needless-range-loop` implied by `-D warnings`
= help: to override `-D warnings` add `#[allow(clippy::needless_range_loop)]`
help: consider using an iterator and enumerate()
|
351 - for i in 0..array.len() {
351 + for (i, <item>) in hashes_buffer.iter_mut().enumerate().take(array.len()) {
|
e8fb251 to
a68f0d8
Compare
datafusion/common/src/hash_utils.rs
Outdated
| unreachable!() | ||
| }; | ||
|
|
||
| let mut child_hashes = vec![None; 128]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is 128 the max? I feel like just using a HashMap and then doing a get() below instead would be more readable and the same performance (maybe unless we 128 single item arrays or something...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Type ids are i8s, and negative values should not be valid so we bound the total number of spaces to 128
In terms of using a direct-indexed Vec by type id over a hashmap, I just chose not to overcomplicate things. Fwiw, I've yet to write benchmarks for this (I'll plan on doing that later on) and if this part shows up in a profile, I'll happily update it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scratch that. I'll use a hashmap
a68f0d8 to
55ddb39
Compare
Which issue does this PR close?
UnionArrays hashable #18717Rationale for this change
This PR adds hash support for Union data types, enabling group by, distinct, hash joins, and aggregations on union-typed columns
hash_union_arrayhashes each child array once. Then for every row, it uses the type id and offset to retrieve the appropriate hash value